Creating a new language for OCR
Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan
-
- User
- Posts: 4
- Joined: Tue Oct 10, 2017 7:19 pm
Creating a new language for OCR
Dear Recipients,
I was wandering whether it is possible to install a new language (new languages) for the OCR function in PDF X-Change Editor.
The language I would mostly like to use is Pāli, an early language in which Buddha's teaching are preserved. Pāli is spelled in many alphabets, among others extended roman with charachters such as ā, ī, ū, ṅ, ṇ, ñ, ṭ, ḍ, ṃ (or ṁ), and ḷ. Do we need an particular language pack for this or just one that supports Latin Extended-A and Latin Extended Additional, which would also work for Vietnamese, for example?
Thank you for your attention.
Kind regards,
Antonio
I was wandering whether it is possible to install a new language (new languages) for the OCR function in PDF X-Change Editor.
The language I would mostly like to use is Pāli, an early language in which Buddha's teaching are preserved. Pāli is spelled in many alphabets, among others extended roman with charachters such as ā, ī, ū, ṅ, ṇ, ñ, ṭ, ḍ, ṃ (or ṁ), and ḷ. Do we need an particular language pack for this or just one that supports Latin Extended-A and Latin Extended Additional, which would also work for Vietnamese, for example?
Thank you for your attention.
Kind regards,
Antonio
- Bhikkhu Pesala
- User
- Posts: 1776
- Joined: Tue May 29, 2007 9:29 am
- Location: East London
- Contact:
Re: Creating a new language for OCR
I see that there is an Additional Language Pack for Vietnamese.
Try that, and let us know your results.
Try that, and let us know your results.
Windows 10 Home 64-bit • AMD Ryzen 5 3400G, 8 Gb
Review: http://www.softerviews.org/PDF-XChange.html
Review: http://www.softerviews.org/PDF-XChange.html
- Tracker Supp-Stefan
- Site Admin
- Posts: 17948
- Joined: Mon Jan 12, 2009 8:07 am
- Location: London
- Contact:
Re: Creating a new language for OCR
Thanks for the help Bhikkhu!
Let us know how it went with the Vietnameese language file Antonio!
Regards,
Stefan
Let us know how it went with the Vietnameese language file Antonio!
Regards,
Stefan
-
- User
- Posts: 4
- Joined: Tue Oct 10, 2017 7:19 pm
Re: Creating a new language for OCR
Thank you for your replies Bhikkhu Pesala and Stefan. I will try and let you know how it goes.
- Tracker Supp-Stefan
- Site Admin
- Posts: 17948
- Joined: Mon Jan 12, 2009 8:07 am
- Location: London
- Contact:
Re: Creating a new language for OCR
Looking forward to your feedback antonio111!
Cheers,
Stefan
Cheers,
Stefan
-
- User
- Posts: 4
- Joined: Tue Oct 10, 2017 7:19 pm
Re: Creating a new language for OCR
Dear Stefan and Bhikkhu Pesala,
I have used OCR function on a pāli text, setting vietnamese as recognizing language. The recognization is good where there are no diacritical marks but letters as ā, ī, ū, ṇ, ṅ, ñ, ṃ, ṭ, ḍ seem totally misread by the program.
I compared then the same text whith another recognization of it done setting another language as recognization language. I did it some months ago, I don't remember which language I used, probably English. The result using this language was very similar except that this other language could recognize ñ.
I remember I tried to recognize another pāli text (with PDF X-change) some time ago using many languages in order to test which one was most fit to Pāli. Among these languages was Latvian which seems to have some diacritical marks as in Pāli. However I remember that the results were not even good in that case. This is the reason for which I have asked myself and others whether we can have better results using a Pāli language pack. What do you think about it?
Kind regards,
Antonio
I have used OCR function on a pāli text, setting vietnamese as recognizing language. The recognization is good where there are no diacritical marks but letters as ā, ī, ū, ṇ, ṅ, ñ, ṃ, ṭ, ḍ seem totally misread by the program.
I compared then the same text whith another recognization of it done setting another language as recognization language. I did it some months ago, I don't remember which language I used, probably English. The result using this language was very similar except that this other language could recognize ñ.
I remember I tried to recognize another pāli text (with PDF X-change) some time ago using many languages in order to test which one was most fit to Pāli. Among these languages was Latvian which seems to have some diacritical marks as in Pāli. However I remember that the results were not even good in that case. This is the reason for which I have asked myself and others whether we can have better results using a Pāli language pack. What do you think about it?
Kind regards,
Antonio
- Paul - Tracker Supp
- Site Admin
- Posts: 6901
- Joined: Wed Mar 25, 2009 10:37 pm
- Location: Chemainus, Canada
- Contact:
Re: Creating a new language for OCR
Hi antonio111,
unfortunately we do not maintain the OCR libraries. This is the only component of PDF-XChange that is not entirely and 100% our code. We use the Tesseract libraries for the OCR : https://en.wikipedia.org/wiki/Tesseract_(software)
As such we do not manage the available languages and you would be best to get in touch with them for adding new language support.
One suggestion we have, since you are seeing relatively good results with some of the existing languages, that you enable, in the editor, more than one. That may well give you the best of a number of different languages.
It is also possible to add any of the languages currently listed listed here: https://github.com/tesseract-ocr/tessdata
To make any of those language files usable in the Editor rename the desired file from XXX.traindeddata to XXX_pxvocr.dat, and place it into ”C:\Programe Files\Tracker Software\PDF Editor\PluginsData\OCRLanguages” .
You will also need to create a corresponding XML file with name XXX_pxvocr.lng and the following content:
That will give you access to any language Tesserac support.
I hope that helps.
unfortunately we do not maintain the OCR libraries. This is the only component of PDF-XChange that is not entirely and 100% our code. We use the Tesseract libraries for the OCR : https://en.wikipedia.org/wiki/Tesseract_(software)
As such we do not manage the available languages and you would be best to get in touch with them for adding new language support.
One suggestion we have, since you are seeing relatively good results with some of the existing languages, that you enable, in the editor, more than one. That may well give you the best of a number of different languages.
It is also possible to add any of the languages currently listed listed here: https://github.com/tesseract-ocr/tessdata
To make any of those language files usable in the Editor rename the desired file from XXX.traindeddata to XXX_pxvocr.dat, and place it into ”C:\Programe Files\Tracker Software\PDF Editor\PluginsData\OCRLanguages” .
You will also need to create a corresponding XML file with name XXX_pxvocr.lng and the following content:
Code: Select all
<?xml version=”1.0" encoding=”utf-8”?>
<language name="UI Name of the Language” prefix=”XXX” version=”1.00"/>
I hope that helps.
Best regards
Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
-
- User
- Posts: 4
- Joined: Tue Oct 10, 2017 7:19 pm
Re: Creating a new language for OCR
Dear Paul,
Thank you very much for your support. I appreciate it!
Kind regards,
Antonio
Thank you very much for your support. I appreciate it!
Kind regards,
Antonio
- Paul - Tracker Supp
- Site Admin
- Posts: 6901
- Joined: Wed Mar 25, 2009 10:37 pm
- Location: Chemainus, Canada
- Contact:
Re: Creating a new language for OCR
My pleasure Antonio.
Best regards
Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com