bug(high_accuracy): OCR layer area

Discussion for the End User use of OCR in PDF-XChange Editor and Viewer

Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan

Post Reply
SashaChernykh
User
Posts: 11
Joined: Tue Aug 20, 2019 8:01 am

bug(high_accuracy): OCR layer area

Post by SashaChernykh »

1. Summary

If I selected High Accuracy in OCR settings:

    I was getting incorrect area for OCR layer.

2. Possibly related issues

Please, read my issue for Tesseract.

3. Environment and settings

3.1. Environment
  1. Windows 10 Enterprise LTSB 64-bit EN
  2. PDF-XChange Editor Free Version 8.0, Build 332.0, Portable
3.2. Settings
  1. Default OCR-Engine
  2. “High” Accuracy in “Convert” → “OCR Page(s)” settings:
Image

4. Example data

Page of my real book without OCR:
5. Steps to reproduce

I opened KiraTemperament.pdfConvertOCR Page(s) → I applied settings from 3.2 item → OK → in file with OCR layer I tried to find “темперамента” word.

6. Expected behavior

If “Medium” accuracy:
Image

7. Unexpected behavior

Else “High” Accuracy:
Image

Last letter of “темперамента” word not selected.

8. Do not offer

Yes I read about cases, when I should use “High” or “Medium” Accuracy.

Thanks.
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8439
Joined: Wed Jan 03, 2018 6:52 pm

Re: bug(high_accuracy): OCR layer area

Post by TrackerSupp-Daniel »

Hello SashaChernykh,

Thank you for the report, You are quite correct in point 2 that the issue is related, for that matter it is the exact same issue as you described, and as the others described in the "duplicate post" that another user there mentioned.

Our "Default" OCR engine is indeed the tesseract engine, and unfortunately, if the engine itself is having these sorts of issues, there is nothing we can do from our end until it is resolved over there first.

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
SashaChernykh
User
Posts: 11
Joined: Tue Aug 20, 2019 8:01 am

Re: bug(high_accuracy): OCR layer area

Post by SashaChernykh »

Type: Question

What OCR engine did the program use in version 7 and previous? Was it Tesseract or not?

Thanks.
User avatar
Will - Tracker Supp
Site Admin
Posts: 6815
Joined: Mon Oct 15, 2012 9:21 pm
Location: London, UK
Contact:

Re: bug(high_accuracy): OCR layer area

Post by Will - Tracker Supp »

Hi Sasha,

That's correct and the current release still uses the Tesseract engine for the free, default OCR. The Enhanced OCR licenses Lead Tools' technology. If you have a valid license with active maintenance, for the Editor, then you can try the Enhanced OCR for 30 days. You can switch under File --> Preferences --> OCR. If you don't have the Enhanced OCR as an option and you have a valid license, you would need to re-install Version 8 and make sure that the option to install the Enhanced OCR plugin is selected.

Thanks,
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.

Best regards

Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
SashaChernykh
User
Posts: 11
Joined: Tue Aug 20, 2019 8:01 am

Re: bug(high_accuracy): OCR layer area

Post by SashaChernykh »

Type: Clarifying question

For free PDF-XChange Editor without any watermarks:
  1. Version 7 — LeadTools OCR SDK
  2. Version 8 Default OCR Engine (Enhanced is not free) — Tesseract
Am I wrong somewhere?

Thanks.
User avatar
Will - Tracker Supp
Site Admin
Posts: 6815
Joined: Mon Oct 15, 2012 9:21 pm
Location: London, UK
Contact:

Re: bug(high_accuracy): OCR layer area

Post by Will - Tracker Supp »

Hi Sasha,

That's not quite correct.
  1. Version 7 = Tesseract Engine only :: We did not have the Enhanced OCR in Version 7
  2. Version 8 default, free OCR engine = Tesseract; Enhanced OCR = Lead Tools SDK (free for 30 days with any license valid for the Editor)
Thanks,
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.

Best regards

Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
SashaChernykh
User
Posts: 11
Joined: Tue Aug 20, 2019 8:01 am

Re: bug(high_accuracy): OCR layer area

Post by SashaChernykh »

Type: Objection

1. Citation
there is nothing we can do from our end until it is resolved over there first.
2. Objection
  1. For versions 7 and 8 PDF-XChange Editor uses the same third-party software — Tesseract.
  2. But for version 7 with the same settings I got better selected area; I had no serious problems with selected area.
Maybe it problem for PDF-XChange Editor, not for third-party tools? Why not :) ?

3. Example
  • 7.0 Build 328.1:
Image
  • 8.0, Build 332.0:
Image

If more examples needed — please, tell me.

Thanks.
User avatar
Will - Tracker Supp
Site Admin
Posts: 6815
Joined: Mon Oct 15, 2012 9:21 pm
Location: London, UK
Contact:

Re: bug(high_accuracy): OCR layer area

Post by Will - Tracker Supp »

Hi Sasha,

Thanks for that - While there is definitely and absolutely a bug in the Tesseract engine itself, a change in the results does strongly suggest an issue in our software (unless the Tesseract libraries have changed and I'm not aware). Is there any difference at all in your OCR settings between Version 7 Build 328.1 and Version 8 Build 332?

Also, have you tried using Medium accuracy? If not, please do and see if that helps.

Thanks,
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.

Best regards

Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
SashaChernykh
User
Posts: 11
Joined: Tue Aug 20, 2019 8:01 am

Re: bug(high_accuracy): OCR layer area

Post by SashaChernykh »

Type: Additional data

1. Example book
  1. KiraFullWithoutOCR.pdf — book without OCR
  2. KiraFullOCRVersion7.pdf — book with OCR layer from version 7
  3. KiraFullOCRVersion8.pdf — book with OCR layer from version 8
2. Settings

2.1. OCR
  • 7.0 Build 328.1:
Image
  • 8.0 Build 332.0:
Image

2.2. Note

In both cases I use Medium, not High Accuracy.

3. Results

2 examples; compare KiraFullOCRVersion7.pdf and KiraFullOCRVersion8.pdf for more.

3.1. “темперамента”
  • 7:
Image
  • 8:
Image

3.2. “Гремиславы”
  • 7:
Image
  • 8:
Image

4. Reasons of using PDF-XChange Editor

See my detailed Software Recommendations answer. I recommended PDF-XChange Editor precisely because I didn't had big problems with selected areas. But I can't see reasons, why users should prefer version 8, if it has the same problems as another Tesseract-based tools.

Thanks.
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8439
Joined: Wed Jan 03, 2018 6:52 pm

Re: bug(high_accuracy): OCR layer area

Post by TrackerSupp-Daniel »

Hello Sasha,

I am unsure what you would like me to say here. Much like our software has occasional updates, so too does the tesseract engine, it is entirely possible that the V7 uses an older version of the engine from before this issue was introduced. We will look into this and see if there is anything to be done from our end, but it does currently still look very likely that the issue is on the Engine side, not from our software.

I cannot say that using the tesseract engine in our software would offer anything superior to using the exact same version of the engine in any other application, I can however say that the other features our application, which can be used on the document for editing both before and after running OCR are obvious benefits. Otherwise, if you are looking for advanced functions and majorly noticeable benefits in the OCR alone, you would need to look into our new EOCR plugin, which uses LeadTools OCR engine instead of tesseract.

We will endeavour to have this issue resolved as soon as possible, but we are likely going to be stuck waiting for a new version of the tesseract engine to be made available.

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Post Reply