I can't OCR a file
Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan
I can't OCR a file
I have a file which I have tried repeatedly to OCR but somehow it doesn't work. The OCR seems to progress as usual, but after it's finished, I cannot use the "find a word" function, as if nothing changed. What could be wrong? This is AFAIK a problem only with this one file (which is from an e-book, so it shouldn't need an OCR in the first place.)
- Tracker Supp-Stefan
- Site Admin
- Posts: 17929
- Joined: Mon Jan 12, 2009 8:07 am
- Location: London
- Contact:
Re: I can't OCR a file
Hello mmoni,
Can you please send us the file in question to support@pdf-xchange.com so that we can take a look at it from our end?
Also - are you using the Default or Enhanced OCR engine?
Kind regards,
Stefan
Can you please send us the file in question to support@pdf-xchange.com so that we can take a look at it from our end?
Also - are you using the Default or Enhanced OCR engine?
Kind regards,
Stefan
Re: I can't OCR a file
I've tried both default and enhanced. I'll send you the problematic file in a moment.
- TrackerSupp-Daniel
- Site Admin
- Posts: 8600
- Joined: Wed Jan 03, 2018 6:52 pm
Re: I can't OCR a file
Hi, mmoni
This document already contains text content, and as such, is ignored by the OCR process so as to prevent multiple overlapping text layers.
The reason that searching does not work is that the text (and particularly the used font information) in this file, is corrupted. This means that the data you see on the page visually, does not reflect the actual data in the document. As a comparison, this is what happens when copying the existing text to notepad: This gibberish of broken characters, is what is seen by all applications when you try to search for any text within the file, and is why you are not getting any results from your search. Meanwhile the presence of "text" content, broken or otherwise, is preventing the OCR function from placing any new text in those positions.
To rectify this, you will need to "rasterize" the entire document, converting all existing text and other types of content, into a single image per page. The "rasterize pages" tool is located on the Convert tab. Once the Rasterization is complete, then you can run OCR, and will find that new OCR'd text is added to the document, which can be searched. Do note that OCR is not perfect, and it is possible that there will be some errors in text generated this way. If at all possible, it may be a better idea to try and locate an original version of this document which may not have corrupted font information to begin with.
Kind regards,
This document already contains text content, and as such, is ignored by the OCR process so as to prevent multiple overlapping text layers.
The reason that searching does not work is that the text (and particularly the used font information) in this file, is corrupted. This means that the data you see on the page visually, does not reflect the actual data in the document. As a comparison, this is what happens when copying the existing text to notepad: This gibberish of broken characters, is what is seen by all applications when you try to search for any text within the file, and is why you are not getting any results from your search. Meanwhile the presence of "text" content, broken or otherwise, is preventing the OCR function from placing any new text in those positions.
To rectify this, you will need to "rasterize" the entire document, converting all existing text and other types of content, into a single image per page. The "rasterize pages" tool is located on the Convert tab. Once the Rasterization is complete, then you can run OCR, and will find that new OCR'd text is added to the document, which can be searched. Do note that OCR is not perfect, and it is possible that there will be some errors in text generated this way. If at all possible, it may be a better idea to try and locate an original version of this document which may not have corrupted font information to begin with.
Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Re: I can't OCR a file
Daniel, why is your recommendation "To rectify this, you will need to "rasterize" the entire document..." when you might just have a problem on some pages which could be fixed with targetted-pages rasterization?
All best,
FringePhil
FringePhil
- Tracker Supp-Stefan
- Site Admin
- Posts: 17929
- Joined: Mon Jan 12, 2009 8:07 am
- Location: London
- Contact:
Re: I can't OCR a file
Hello PHK,
If it's only a few pages you have issues with - by all means - please only do the rasterization on those pages.
Alternatively - you can even open the content pane in the Editor, and for the affected page(s) - just remove the gibberish existing OCR text, and leave the original image as it is. Avoiding re-rasterizing the content will save you some image quality
Kind regards,
Stefan
If it's only a few pages you have issues with - by all means - please only do the rasterization on those pages.
Alternatively - you can even open the content pane in the Editor, and for the affected page(s) - just remove the gibberish existing OCR text, and leave the original image as it is. Avoiding re-rasterizing the content will save you some image quality
Kind regards,
Stefan