I can't OCR a file

Forum for the PDF-XChange Editor - Free and Licensed Versions

Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan

Post Reply
mmoni
User
Posts: 16
Joined: Sun Jul 20, 2008 9:57 am

I can't OCR a file

Post by mmoni »

I have a file which I have tried repeatedly to OCR but somehow it doesn't work. The OCR seems to progress as usual, but after it's finished, I cannot use the "find a word" function, as if nothing changed. What could be wrong? This is AFAIK a problem only with this one file (which is from an e-book, so it shouldn't need an OCR in the first place.)
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17893
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: I can't OCR a file

Post by Tracker Supp-Stefan »

Hello mmoni,

Can you please send us the file in question to support@pdf-xchange.com so that we can take a look at it from our end?
Also - are you using the Default or Enhanced OCR engine?

Kind regards,
Stefan
mmoni
User
Posts: 16
Joined: Sun Jul 20, 2008 9:57 am

Re: I can't OCR a file

Post by mmoni »

I've tried both default and enhanced. I'll send you the problematic file in a moment.
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8556
Joined: Wed Jan 03, 2018 6:52 pm

Re: I can't OCR a file

Post by TrackerSupp-Daniel »

Hi, mmoni

This document already contains text content, and as such, is ignored by the OCR process so as to prevent multiple overlapping text layers.

The reason that searching does not work is that the text (and particularly the used font information) in this file, is corrupted. This means that the data you see on the page visually, does not reflect the actual data in the document. As a comparison, this is what happens when copying the existing text to notepad:
image.png
This gibberish of broken characters, is what is seen by all applications when you try to search for any text within the file, and is why you are not getting any results from your search. Meanwhile the presence of "text" content, broken or otherwise, is preventing the OCR function from placing any new text in those positions.

To rectify this, you will need to "rasterize" the entire document, converting all existing text and other types of content, into a single image per page. The "rasterize pages" tool is located on the Convert tab. Once the Rasterization is complete, then you can run OCR, and will find that new OCR'd text is added to the document, which can be searched. Do note that OCR is not perfect, and it is possible that there will be some errors in text generated this way. If at all possible, it may be a better idea to try and locate an original version of this document which may not have corrupted font information to begin with.

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
User avatar
PHK
User
Posts: 938
Joined: Tue Nov 24, 2020 4:02 pm

Re: I can't OCR a file

Post by PHK »

Daniel, why is your recommendation "To rectify this, you will need to "rasterize" the entire document..." when you might just have a problem on some pages which could be fixed with targetted-pages rasterization?
All best,

FringePhil
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17893
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: I can't OCR a file

Post by Tracker Supp-Stefan »

Hello PHK,

If it's only a few pages you have issues with - by all means - please only do the rasterization on those pages.

Alternatively - you can even open the content pane in the Editor, and for the affected page(s) - just remove the gibberish existing OCR text, and leave the original image as it is. Avoiding re-rasterizing the content will save you some image quality :)

Kind regards,
Stefan
Post Reply