Creating a new language for OCR

Forum for the PDF-XChange Editor - Free and Licensed Versions

Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Sean - Tracker, Paul - Tracker Supp, Chris - Tracker Supp, Tracker Supp-Stefan, Ivan - Tracker Software

Post Reply
antonio111
User
Posts: 4
Joined: Tue Oct 10, 2017 7:19 pm

Creating a new language for OCR

Post by antonio111 » Tue Oct 10, 2017 7:39 pm

Dear Recipients,

I was wandering whether it is possible to install a new language (new languages) for the OCR function in PDF X-Change Editor.

The language I would mostly like to use is Pāli, an early language in which Buddha's teaching are preserved. Pāli is spelled in many alphabets, among others extended roman with charachters such as ā, ī, ū, ṅ, ṇ, ñ, ṭ, ḍ, ṃ (or ṁ), and ḷ. Do we need an particular language pack for this or just one that supports Latin Extended-A and Latin Extended Additional, which would also work for Vietnamese, for example?

Thank you for your attention.

Kind regards,
Antonio

User avatar
Bhikkhu Pesala
User
Posts: 1774
Joined: Tue May 29, 2007 9:29 am
Location: East London
Contact:

Re: Creating a new language for OCR

Post by Bhikkhu Pesala » Wed Oct 11, 2017 9:44 am

I see that there is an Additional Language Pack for Vietnamese.

Try that, and let us know your results.
Windows 10 64-bit • AMD A10-6800K, 8 Gbyte RAM
Review: http://www.softerviews.org/PDF-XChange.html

User avatar
Tracker Supp-Stefan
Site Admin
Posts: 14196
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: Creating a new language for OCR

Post by Tracker Supp-Stefan » Wed Oct 11, 2017 10:38 am

Thanks for the help Bhikkhu!

Let us know how it went with the Vietnameese language file Antonio!

Regards,
Stefan

antonio111
User
Posts: 4
Joined: Tue Oct 10, 2017 7:19 pm

Re: Creating a new language for OCR

Post by antonio111 » Wed Oct 11, 2017 9:07 pm

Thank you for your replies Bhikkhu Pesala and Stefan. I will try and let you know how it goes.

User avatar
Tracker Supp-Stefan
Site Admin
Posts: 14196
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: Creating a new language for OCR

Post by Tracker Supp-Stefan » Thu Oct 12, 2017 10:39 am

Looking forward to your feedback antonio111!

Cheers,
Stefan

antonio111
User
Posts: 4
Joined: Tue Oct 10, 2017 7:19 pm

Re: Creating a new language for OCR

Post by antonio111 » Thu Oct 12, 2017 7:01 pm

Dear Stefan and Bhikkhu Pesala,

I have used OCR function on a pāli text, setting vietnamese as recognizing language. The recognization is good where there are no diacritical marks but letters as ā, ī, ū, ṇ, ṅ, ñ, ṃ, ṭ, ḍ seem totally misread by the program.

I compared then the same text whith another recognization of it done setting another language as recognization language. I did it some months ago, I don't remember which language I used, probably English. The result using this language was very similar except that this other language could recognize ñ.

I remember I tried to recognize another pāli text (with PDF X-change) some time ago using many languages in order to test which one was most fit to Pāli. Among these languages was Latvian which seems to have some diacritical marks as in Pāli. However I remember that the results were not even good in that case. This is the reason for which I have asked myself and others whether we can have better results using a Pāli language pack. What do you think about it?

Kind regards,
Antonio

User avatar
Paul - Tracker Supp
Site Admin
Posts: 5108
Joined: Wed Mar 25, 2009 10:37 pm
Location: Chemainus, Canada
Contact:

Re: Creating a new language for OCR

Post by Paul - Tracker Supp » Fri Oct 13, 2017 6:29 pm

Hi antonio111,

unfortunately we do not maintain the OCR libraries. This is the only component of PDF-XChange that is not entirely and 100% our code. We use the Tesseract libraries for the OCR : https://en.wikipedia.org/wiki/Tesseract_(software)

As such we do not manage the available languages and you would be best to get in touch with them for adding new language support.

One suggestion we have, since you are seeing relatively good results with some of the existing languages, that you enable, in the editor, more than one. That may well give you the best of a number of different languages.

It is also possible to add any of the languages currently listed listed here: https://github.com/tesseract-ocr/tessdata

To make any of those language files usable in the Editor rename the desired file from XXX.traindeddata to XXX_pxvocr.dat, and place it into ”C:\Programe Files\Tracker Software\PDF Editor\PluginsData\OCRLanguages” .

You will also need to create a corresponding XML file with name XXX_pxvocr.lng and the following content:

Code: Select all

<?xml version=”1.0" encoding=”utf-8”?>
<language name="UI Name of the Language” prefix=”XXX” version=”1.00"/>
That will give you access to any language Tesserac support.

I hope that helps.
_________________
If posting files to this forum - you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded - thank you.

Best regards

Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com

antonio111
User
Posts: 4
Joined: Tue Oct 10, 2017 7:19 pm

Re: Creating a new language for OCR

Post by antonio111 » Fri Oct 13, 2017 8:24 pm

Dear Paul,

Thank you very much for your support. I appreciate it!

Kind regards,
Antonio

User avatar
Paul - Tracker Supp
Site Admin
Posts: 5108
Joined: Wed Mar 25, 2009 10:37 pm
Location: Chemainus, Canada
Contact:

Re: Creating a new language for OCR

Post by Paul - Tracker Supp » Fri Oct 13, 2017 8:35 pm

:D

My pleasure Antonio.
_________________
If posting files to this forum - you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded - thank you.

Best regards

Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com

Post Reply