"Jumping" letters as a result of OCR

Discussion for the End User use of OCR in PDF-XChange Editor and Viewer

Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan

Post Reply
User avatar
Jensen Head
User
Posts: 412
Joined: Mon Sep 13, 2021 8:12 am

"Jumping" letters as a result of OCR

Post by Jensen Head »

After recognizing scanned text and adding an invisible text layer on some computers, some of the letters of the text get a vertical offset:
_
Jumping letters as a result of OCR.png
_
Moreover, it doesn’t matter whether the application was used PDF-XChange Editor Plus 9.2, build 359.0 (Enhanced OCR) or PDF-Tools. "Convert to searchable PDF document" from ABBYY FineReader PDF 15 creates a document without such artifacts, but the document size is almost 20 times larger. In both versions, the same font is used for the compared text fragment - Times New Roman. However, the set of fonts embedded in the new files is different. In any case, the fonts should not affect the display of the text because they are below the image. Artifacts are observed on the same computer in Microsoft Edge 100.0.1185.29 and Google Chrome 99.0.4844 and 100.0.4896.60. On the same computer, Adobe Acrobat Pro DC 21.001.2014Z 30912 displays documents without artifacts. Windows 10. I noticed that the bug is not displayed at all page display scales.

On the computer on which recognition is performed, documents in all applications are displayed without artifacts.

FineReader-ed.pdf (22.2 MB) — https://drive.google.com/file/d/1GuFDOxxpl4ZAwfzwbqK2x9gla756MAGh/view

Upd. third computer (Windows 10 Pro 21H2 build 19044.1348):
Adobe Acrobat Reader DC 2021.007.200911 x64 — PDF-Tools-ed.pdf and FineReader-ed.pdf OK
Google Chrome 100.0.4896.60 x64 — PDF-Tools-ed.pdf bad, FineReader-ed.pdf OK
Microsoft Edge 100.0.1185.29 x64 — PDF-Tools-ed.pdf bad, FineReader-ed.pdf OK
Mozilla Firefox 98.0.2 x64 — PDF-Tools-ed.pdf and FineReader-ed.pdf OK

Upd 2: fourth computer (Microsoft Windows 10.0.19044.1586):
PDF-XChange Viewer 2.5 build 201.0 — PDF-Tools-ed.pdf and FineReader-ed.pdf OK
Microsoft Edge 100.0.1185.29 — PDF-Tools-ed.pdf bad, FineReader-ed.pdf OK
Mozilla Firefox 98.0.2 x86 — PDF-Tools-ed.pdf and FineReader-ed.pdf OK

Upd 3. fifth computer (Microsoft Windows 10.0.19041.1620):
Microsoft Edge 100.0.1185.29 x64 — PDF-Tools-ed.pdf bad, FineReader-ed.pdf OK

Upd 4. sixth computer (Microsoft Windows 11):
PDF-XChange Editor Plus 9.2 359.0 (Enhanced OCR) — PDF-Tools-ed.pdf and FineReader-ed.pdf OK
Opera 85.0.4341.47 — PDF-Tools-ed.pdf bad, FineReader-ed.pdf OK
Google Chrome 99.0.4844.84 x64 — PDF-Tools-ed.pdf bad, FineReader-ed.pdf OK
Attachments
PDF-Tools-ed.pdf
(1.22 MiB) Downloaded 127 times
Original.pdf
(1.14 MiB) Downloaded 125 times
Last edited by Jensen Head on Wed Apr 06, 2022 7:28 am, edited 2 times in total.
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8440
Joined: Wed Jan 03, 2018 6:52 pm

Re: "Jumping" letters as a result of OCR

Post by TrackerSupp-Daniel »

Hello, Jensen Head

I am afraid that I do not see any suibstantial changes between these two files, the "skew" of the image was altered a small amount, to line it up with the newly generated invisible text, but beyond that, I cannot identify any of these artifacts you speak of when opening the file in Edge or Chrome. Could I ask you for screenshots showcasing the artifacts you are seeing, and any other details you can offer about how to locate the artifacts in each case?

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
User avatar
Jensen Head
User
Posts: 412
Joined: Mon Sep 13, 2021 8:12 am

Re: "Jumping" letters as a result of OCR

Post by Jensen Head »

It seemed to me that in the screenshot given in the first post, the defect is striking. Take a look at another screenshot taken in different applications:
_
Jumping letters as a result of OCR — comparison.png
_
(above - a variant with letters shifted vertically, below - a correctly displayed document) The original document and the saved in ABBYY FineReader PDF 15 are displayed without artifacts in all applications on all text computers.
User avatar
Paul - Tracker Supp
Site Admin
Posts: 6837
Joined: Wed Mar 25, 2009 10:37 pm
Location: Chemainus, Canada
Contact:

Re: "Jumping" letters as a result of OCR

Post by Paul - Tracker Supp »

Aaaah - thanks for pointing that out.

I will need to get the dev that does the OCR to tell is what can be done about this.

He's not available this morning however, so I would ask your patience until I can get him to look at this.

please and thanks
Best regards

Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
User avatar
Paul - Tracker Supp
Site Admin
Posts: 6837
Joined: Wed Mar 25, 2009 10:37 pm
Location: Chemainus, Canada
Contact:

Re: "Jumping" letters as a result of OCR

Post by Paul - Tracker Supp »

The devs took a look at this and said we will fix it.

The ticket is RT6014: "Jumping" letters as a result of OCR

As always, the ticket is internal only, ask here for updates.

regards
Best regards

Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
User avatar
Vasyl-Tracker Dev Team
Site Admin
Posts: 2352
Joined: Thu Jun 30, 2005 4:11 pm
Location: Canada

Re: "Jumping" letters as a result of OCR

Post by Vasyl-Tracker Dev Team »

Hi Jensen.

We found the reason for this issue. It looks like you OCRed that Original.pdf with 'Detect skew.. ' and 'Fix skew..' options enabled. As result - the images on the pages were slightly clock-wise rotated. Everything is ok with that, except the fact that Chrome/Edge incorrectly displays such rotated images, for unknown reasons. While all other pdf-viewer apps show this doc correctly. So there is a bug in Chrome/Edge definitely and you may report this case to them.

However, I can suggest one good workaround for this: you can disable the 'Fix skew..' option and keep the 'Detect skew..'" option enabled. In this case the OCR will be able to correctly recognize text on slightly rotated scans, but will not correct such rotations.

Cheers.
Vasyl Yaremyn
Tracker Software Products
Project Developer

Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
User avatar
Jensen Head
User
Posts: 412
Joined: Mon Sep 13, 2021 8:12 am

Re: "Jumping" letters as a result of OCR

Post by Jensen Head »

Vasyl-Tracker Dev Team wrote: Tue Apr 05, 2022 7:02 pmIt looks like you OCRed that Original.pdf with 'Detect skew.. ' and 'Fix skew..' options enabled. As result - the images on the pages were slightly clock-wise rotated.
You are right, after «"Create PDF from Images": "Deskew" and "Detect skew of image"» (https://forum.pdf-xchange.com/viewtopic.php?f=70&t=37563) discussion, I always enable both checkboxes - the "Detect skew of page content" [1] and "Fix content skew and incorrect page rotation" [2].

[1] Enable this option to detect random skew of content on scanned page to straighten it up before the recognition. It may significantly improve the quality of recognition result.
[2] Enable this option to automatically straighten skewed or incorrectly rotated scanned pages. This option is associated with the corresponding recognition options for automatic detection of rotations and skews.
Vasyl-Tracker Dev Team wrote: Tue Apr 05, 2022 7:02 pmEverything is ok with that, except the fact that Chrome/Edge incorrectly displays such rotated images, for unknown reasons.
I updated my first post with tests on other computers, and I can assume that the display bug in question is related to the Chromium browser engine.
Vasyl-Tracker Dev Team wrote: Tue Apr 05, 2022 7:02 pmHowever, I can suggest one good workaround for this: you can disable the 'Fix skew..' option and keep the 'Detect skew..'" option enabled. In this case the OCR will be able to correctly recognize text on slightly rotated scans, but will not correct such rotations.
Unfortunately, in our case, we have to choose between one sloppy kind of document (more than one or two percent skew) and another (random vertical letter misalignment). In addition, our small study of the problem with you does not answer the question, how did your competitors achieve the display of a document with corrected skew without the bug discussed? And how can this method of correcting the skew be implemented in the case of your application package?
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17824
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: "Jumping" letters as a result of OCR

Post by Tracker Supp-Stefan »

Hello Jensen Head,

The issue here is Chrome not handling our files correctly. Why others produce documents that are handled better - I can not comment. However the files that we do create are following the PDF specification as they should, and the fact that Chrome can not handle those files properly you will need to discuss with Google and ask them to implement fixes at their end. It might be that the way we specify the coordinates for those objects is not interpreted by them correctly, but it is still proper and valid PDF file, and all other PDF tools render it correctly. We have offered you a workaround for an issue with a third party software, but if that is not your desired outcome - please do get in touch with them and ask them to investigate further.

Kind regards,
Stefan
User avatar
Vasyl-Tracker Dev Team
Site Admin
Posts: 2352
Joined: Thu Jun 30, 2005 4:11 pm
Location: Canada

Re: "Jumping" letters as a result of OCR

Post by Vasyl-Tracker Dev Team »

Hi Jensen.
...our small study of the problem with you does not answer the question, how did your competitors achieve the display of a document with corrected skew without the bug discussed?
The following explanation is: in this particular case the page contains not just one big image but a bunch of them, overlapped and not. And our app, when deskews such page's content - just rotates this whole content according to the calculated angle. While some other apps rotate them in the same way as we do and then replace the original images in content with the big one, via flattening all originals. Problem is that this way typically increases the size of the result pdf file(and sometimes may reduce the quality of images).

For example, after using Adobe Acrobat's OCR the size of your Original.pdf was increased significantly:

Original.pdf - 1.14 MB
Original_PXEditorOCR.pdf - 1.23 MB (+8%)
Original_AcroOCR.pdf - 2.83 MB (+148%)

As I see, the Chromium PDF-viewer has problems displaying slightly rotated images. In this case, it seems that their display-position is not being calculated correctly. When content has only one rotated image, this problem is hard to see. But when it has a group(s) of images with visual "relationships" between them - then any, even a minor mistake in their display positioning will cause easily visible defects...

As another potential workaround, you may use the sequence DeskewPages+RasterizePages+OCR(without deskewing). In PDF-Tools is easy to automate this job. It will work well but as a side-effect - it may increase the time of work, and also - may increase the size of file.

But generally - it is better to ask the Chromium devs to fix such trouble in their embedded pdf-viewer because it shows certainly incorrect behavior there...

Cheers.
Vasyl Yaremyn
Tracker Software Products
Project Developer

Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
DIV
User
Posts: 252
Joined: Fri Jun 23, 2017 1:47 am

Re: "Jumping" letters as a result of OCR

Post by DIV »

Great explanation, Vasyl! :-)
Vasyl-Tracker Dev Team wrote: Wed Apr 06, 2022 6:59 pm in this particular case the page contains not just one big image but a bunch of them, overlapped and not.
This is something that I have felt slightly uneasy about, since I first realised it. But the substantial reduction in file size that usually results is decisive for me. (Until/Unless I subsequently run into a horrible side-effect ...at which point it's likely 'too late' — can't go back in time.)
Vasyl-Tracker Dev Team wrote: Wed Apr 06, 2022 6:59 pm As another potential workaround, you may use the sequence DeskewPages+RasterizePages+OCR(without deskewing). In PDF-Tools is easy to automate this job. It will work well but [...].
Just wondering, must this workaround follow the sequence
DeskewPages+RasterizePages+OCR(without deskewing)
or would
RasterizePages+DeskewPages+OCR(without deskewing)
be just as good? Or potentially slightly faster?

—DIV
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17824
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: "Jumping" letters as a result of OCR

Post by Tracker Supp-Stefan »

Hello DIV,

I would recommend you to test both options - if you rasterize first - the file size might be lightly bigger, though the files should then still be handled correctly by Chromium's engine. As long as you do not do the deskew at the OCR step you should be good.

If you have an Editor Plus/Pro license - then the EOCR with "Fine Page content" should create even smaller files - as it will replace the image pixels with actual text. Have you tried that option as well?

Kind regards,
Stefan
DIV
User
Posts: 252
Joined: Fri Jun 23, 2017 1:47 am

Re: "Jumping" letters as a result of OCR

Post by DIV »

Thanks, Stefan.

These are the settings I have tended towards
image.png
I don't deskew the output image, as — in the files I've been dealing with — the rotations are small enough that overlaid (invisible) text will still be pretty well aligned with the image; but if I deskew, then I found there was a (small) loss in image quality.

Cutting file size is not my top priority: I will go for the smallest file size that can deliver good visual fidelity with the original resource.
Also, I really don't want to bother spending my time to fix OCR errors (even if they rarely occur).

—DIV
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17824
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: "Jumping" letters as a result of OCR

Post by Tracker Supp-Stefan »

Hello DIV,

In your previous post you said "... reduction in file size that usually results is decisive for me." so I assumed that this is the most important element of the process!

On your screenshot - I was proposing for you to try the Editable Text and Images or "Fine Page Content":
image.png
However it seems like you do want to preserve the original image and display it as it was scanned - so those are likely not going to appeal to you as options (though the file size will be smaller.

Kind regards,
Stefan
Post Reply