highlight text consistency in PDFs with OCR layer text below image

Forum for the PDF-XChange Editor - Free and Licensed Versions

Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan

Post Reply
Kunjan
User
Posts: 6
Joined: Thu Nov 17, 2022 7:39 am

highlight text consistency in PDFs with OCR layer text below image

Post by Kunjan »

I would like to suggest a feature improvement.

Under certain circumstances, when highlighting text that has been OCR'd with the original image of the text showing on top, the highlight shows in a jagged manner. See the attachment. To improve readibility, the highlight ought to be smooth. See same attachment for how Xodo handles the highlighting in the same document - a lot smoother. TY.
PDF X-Change Highlight Text Inconsistent.png
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8436
Joined: Wed Jan 03, 2018 6:52 pm

Re: highlight text consistency in PDFs with OCR layer text below image

Post by TrackerSupp-Daniel »

Hello, Kunjan

Our Highlight feature is based on the text position and font height, Did you use our OCR process, or was this OCR done by a third party app? I ask becuase our OCR function should generate a more consistent text block, instead of giving the text itself jagged positioning and size, which is what causes this issue.

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Kunjan
User
Posts: 6
Joined: Thu Nov 17, 2022 7:39 am

Re: highlight text consistency in PDFs with OCR layer text below image

Post by Kunjan »

Hello Daniel

I believe it was OCR'd via FileCenter DMS software.

I ran the OCR function in PDF X Change Editor and yes, the highlighted text is no longer jagged. And it's a pleasant surprise that none of my previously highlighted text was "un-highlighted" in the process. Thank you.

It's interesting that Xodo managed however. If you are curious I am happy to send you a few pages.

Thanks again.
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8436
Joined: Wed Jan 03, 2018 6:52 pm

Re: highlight text consistency in PDFs with OCR layer text below image

Post by TrackerSupp-Daniel »

Hello, Kunjan

It wouldnt hurt to have a sample to take a look at, best case scenario, the Devs may be able to find a way to keep our highlights a bit more consistent after seeing this.

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Kunjan
User
Posts: 6
Joined: Thu Nov 17, 2022 7:39 am

Re: highlight text consistency in PDFs with OCR layer text below image

Post by Kunjan »

Hello Daniel

Please see attached. Regards, Kunjan
Sample PDF OCR with FileCenter.pdf
(1.78 MiB) Downloaded 12 times
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8436
Joined: Wed Jan 03, 2018 6:52 pm

Re: highlight text consistency in PDFs with OCR layer text below image

Post by TrackerSupp-Daniel »

Hello, Kunjan

Thank you for the files, the quality of the filecenter OCR on this, is quite frankly, extremely bad, in this screenshot we have turned the invisible "searchable" text visible ontop of the scanned page for comparison:
image (2).png
The "off center" text you see in this screenshot is why the output is jagged, and unfortunately there is little we can do here. In the past we have seen complaints that the OCR was generalizing too much, when we took Xodo's approach and normalized the highlight across an entire line of text, because many users (especially those in mathematics/science fields, as well as authors/editors) would find highlight any equations, or "drop caps" in publishing (respectively), would cause multiple lines of text to be highlighted if they are present. As such, it is not likely we will be able to make any changes to this without reverting that critical change.
The best solution for you here would be that when you notice such issues, you re-do the OCR on the file with a more capable engine (or possibly just avoid using the Filecenter OCR entirely and use a more powerful engine from the start) to minimize how often you will encounter this. I have passed this along to the Dev team, and they are discussing it right now, but at this time, it is not looking likely that any changes will be made. I just wanted to give you an honest answer, instead of instilling false hope.

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Kunjan
User
Posts: 6
Joined: Thu Nov 17, 2022 7:39 am

Re: highlight text consistency in PDFs with OCR layer text below image

Post by Kunjan »

Hello Daniel

Wow, that is bad...!!! Thankfully it's easy to re-OCR using the Tracker engine without losing highlights.

Thank you for the extra information. It makes a big difference as one relies on windows indexing to index the text.

May I ask, how you made the text visible?

Regards, Kunjan
Willy Van Nuffel
User
Posts: 2347
Joined: Wed Jan 18, 2006 12:10 pm

Re: highlight text consistency in PDFs with OCR layer text below image

Post by Willy Van Nuffel »

How to make the OCR-ed text visible ?

- Activate the Content-pane (via the View-ribbon > Panes > Content)
- In the Content-pane, in the toolbar, click Options > Select > Text
- Go into the Text/Arrange tab/ribbon that now appears
- Change the "Fill Color" (first icon in the ribbon) from 'none' to a color you like to see for the text

Kind regards.
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8436
Joined: Wed Jan 03, 2018 6:52 pm

highlight text consistency in PDFs with OCR layer text below image

Post by TrackerSupp-Daniel »

:)
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Kunjan
User
Posts: 6
Joined: Thu Nov 17, 2022 7:39 am

Re: highlight text consistency in PDFs with OCR layer text below image

Post by Kunjan »

TY Willy and Daniel
User avatar
Paul - Tracker Supp
Site Admin
Posts: 6831
Joined: Wed Mar 25, 2009 10:37 pm
Location: Chemainus, Canada
Contact:

highlight text consistency in PDFs with OCR layer text below image

Post by Paul - Tracker Supp »

:)
Best regards

Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
Post Reply