Creating a new language for OCR

Forum for the PDF-XChange Editor - Free and Licensed Versions

Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan

Post Reply
antonio111
User
Posts: 4
Joined: Tue Oct 10, 2017 7:19 pm

Creating a new language for OCR

Post by antonio111 »

Dear Recipients,

I was wandering whether it is possible to install a new language (new languages) for the OCR function in PDF X-Change Editor.

The language I would mostly like to use is Pāli, an early language in which Buddha's teaching are preserved. Pāli is spelled in many alphabets, among others extended roman with charachters such as ā, ī, ū, ṅ, ṇ, ñ, ṭ, ḍ, ṃ (or ṁ), and ḷ. Do we need an particular language pack for this or just one that supports Latin Extended-A and Latin Extended Additional, which would also work for Vietnamese, for example?

Thank you for your attention.

Kind regards,
Antonio
User avatar
Bhikkhu Pesala
User
Posts: 1776
Joined: Tue May 29, 2007 9:29 am
Location: East London
Contact:

Re: Creating a new language for OCR

Post by Bhikkhu Pesala »

I see that there is an Additional Language Pack for Vietnamese.

Try that, and let us know your results.
Windows 10 Home 64-bit • AMD Ryzen 5 3400G, 8 Gb
Review: http://www.softerviews.org/PDF-XChange.html
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17824
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: Creating a new language for OCR

Post by Tracker Supp-Stefan »

Thanks for the help Bhikkhu!

Let us know how it went with the Vietnameese language file Antonio!

Regards,
Stefan
antonio111
User
Posts: 4
Joined: Tue Oct 10, 2017 7:19 pm

Re: Creating a new language for OCR

Post by antonio111 »

Thank you for your replies Bhikkhu Pesala and Stefan. I will try and let you know how it goes.
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17824
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: Creating a new language for OCR

Post by Tracker Supp-Stefan »

Looking forward to your feedback antonio111!

Cheers,
Stefan
antonio111
User
Posts: 4
Joined: Tue Oct 10, 2017 7:19 pm

Re: Creating a new language for OCR

Post by antonio111 »

Dear Stefan and Bhikkhu Pesala,

I have used OCR function on a pāli text, setting vietnamese as recognizing language. The recognization is good where there are no diacritical marks but letters as ā, ī, ū, ṇ, ṅ, ñ, ṃ, ṭ, ḍ seem totally misread by the program.

I compared then the same text whith another recognization of it done setting another language as recognization language. I did it some months ago, I don't remember which language I used, probably English. The result using this language was very similar except that this other language could recognize ñ.

I remember I tried to recognize another pāli text (with PDF X-change) some time ago using many languages in order to test which one was most fit to Pāli. Among these languages was Latvian which seems to have some diacritical marks as in Pāli. However I remember that the results were not even good in that case. This is the reason for which I have asked myself and others whether we can have better results using a Pāli language pack. What do you think about it?

Kind regards,
Antonio
User avatar
Paul - Tracker Supp
Site Admin
Posts: 6837
Joined: Wed Mar 25, 2009 10:37 pm
Location: Chemainus, Canada
Contact:

Re: Creating a new language for OCR

Post by Paul - Tracker Supp »

Hi antonio111,

unfortunately we do not maintain the OCR libraries. This is the only component of PDF-XChange that is not entirely and 100% our code. We use the Tesseract libraries for the OCR : https://en.wikipedia.org/wiki/Tesseract_(software)

As such we do not manage the available languages and you would be best to get in touch with them for adding new language support.

One suggestion we have, since you are seeing relatively good results with some of the existing languages, that you enable, in the editor, more than one. That may well give you the best of a number of different languages.

It is also possible to add any of the languages currently listed listed here: https://github.com/tesseract-ocr/tessdata

To make any of those language files usable in the Editor rename the desired file from XXX.traindeddata to XXX_pxvocr.dat, and place it into ”C:\Programe Files\Tracker Software\PDF Editor\PluginsData\OCRLanguages” .

You will also need to create a corresponding XML file with name XXX_pxvocr.lng and the following content:

Code: Select all

<?xml version=”1.0" encoding=”utf-8”?>
<language name="UI Name of the Language” prefix=”XXX” version=”1.00"/>
That will give you access to any language Tesserac support.

I hope that helps.
Best regards

Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
antonio111
User
Posts: 4
Joined: Tue Oct 10, 2017 7:19 pm

Re: Creating a new language for OCR

Post by antonio111 »

Dear Paul,

Thank you very much for your support. I appreciate it!

Kind regards,
Antonio
User avatar
Paul - Tracker Supp
Site Admin
Posts: 6837
Joined: Wed Mar 25, 2009 10:37 pm
Location: Chemainus, Canada
Contact:

Re: Creating a new language for OCR

Post by Paul - Tracker Supp »

:D

My pleasure Antonio.
Best regards

Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
Post Reply