Creating a new language for OCR

antonio111 · Post by **antonio111** » Tue Oct 10, 2017 7:39 pm

Dear Recipients,

I was wandering whether it is possible to install a new language (new languages) for the OCR function in PDF X-Change Editor.

The language I would mostly like to use is Pāli, an early language in which Buddha's teaching are preserved. Pāli is spelled in many alphabets, among others extended roman with charachters such as ā, ī, ū, ṅ, ṇ, ñ, ṭ, ḍ, ṃ (or ṁ), and ḷ. Do we need an particular language pack for this or just one that supports Latin Extended-A and Latin Extended Additional, which would also work for Vietnamese, for example?

Thank you for your attention.

Kind regards,
Antonio

Bhikkhu Pesala · Post by **Bhikkhu Pesala** » Wed Oct 11, 2017 9:44 am

I see that there is an Additional Language Pack for Vietnamese.

Try that, and let us know your results.

Wed Oct 11, 2017 10:38 am

Thanks for the help Bhikkhu!

Let us know how it went with the Vietnameese language file Antonio!

Regards,
Stefan

antonio111 · Post by **antonio111** » Wed Oct 11, 2017 9:07 pm

Thank you for your replies Bhikkhu Pesala and Stefan. I will try and let you know how it goes.

Thu Oct 12, 2017 10:39 am

Looking forward to your feedback antonio111!

Cheers,
Stefan

antonio111 · Post by **antonio111** » Thu Oct 12, 2017 7:01 pm

Dear Stefan and Bhikkhu Pesala,

I have used OCR function on a pāli text, setting vietnamese as recognizing language. The recognization is good where there are no diacritical marks but letters as ā, ī, ū, ṇ, ṅ, ñ, ṃ, ṭ, ḍ seem totally misread by the program.

I compared then the same text whith another recognization of it done setting another language as recognization language. I did it some months ago, I don't remember which language I used, probably English. The result using this language was very similar except that this other language could recognize ñ.

I remember I tried to recognize another pāli text (with PDF X-change) some time ago using many languages in order to test which one was most fit to Pāli. Among these languages was Latvian which seems to have some diacritical marks as in Pāli. However I remember that the results were not even good in that case. This is the reason for which I have asked myself and others whether we can have better results using a Pāli language pack. What do you think about it?

Kind regards,
Antonio

Post by **Paul - Tracker Supp** » Fri Oct 13, 2017 6:29 pm

Hi antonio111,

unfortunately we do not maintain the OCR libraries. This is the only component of PDF-XChange that is not entirely and 100% our code. We use the Tesseract libraries for the OCR : https://en.wikipedia.org/wiki/Tesseract_(software)

As such we do not manage the available languages and you would be best to get in touch with them for adding new language support.

One suggestion we have, since you are seeing relatively good results with some of the existing languages, that you enable, in the editor, more than one. That may well give you the best of a number of different languages.

It is also possible to add any of the languages currently listed listed here: https://github.com/tesseract-ocr/tessdata

To make any of those language files usable in the Editor rename the desired file from XXX.traindeddata to XXX_pxvocr.dat, and place it into ”C:\Programe Files\Tracker Software\PDF Editor\PluginsData\OCRLanguages” .

You will also need to create a corresponding XML file with name XXX_pxvocr.lng and the following content:

Code: Select all

<?xml version=”1.0" encoding=”utf-8”?>
<language name="UI Name of the Language” prefix=”XXX” version=”1.00"/>

That will give you access to any language Tesserac support.

I hope that helps.

antonio111 · Post by **antonio111** » Fri Oct 13, 2017 8:24 pm

Dear Paul,

Thank you very much for your support. I appreciate it!

Kind regards,
Antonio

Post by **Paul - Tracker Supp** » Fri Oct 13, 2017 8:35 pm

My pleasure Antonio.

Creating a new language for OCR

Creating a new language for OCR

Re: Creating a new language for OCR

Re: Creating a new language for OCR

Re: Creating a new language for OCR

Re: Creating a new language for OCR

Re: Creating a new language for OCR

Re: Creating a new language for OCR

Re: Creating a new language for OCR

Re: Creating a new language for OCR