I can't OCR a file

mmoni · Post by **mmoni** » Sat Jun 19, 2021 8:57 pm

I have a file which I have tried repeatedly to OCR but somehow it doesn't work. The OCR seems to progress as usual, but after it's finished, I cannot use the "find a word" function, as if nothing changed. What could be wrong? This is AFAIK a problem only with this one file (which is from an e-book, so it shouldn't need an OCR in the first place.)

Post by **Tracker Supp-Stefan** » Mon Jun 21, 2021 1:23 pm

Hello mmoni,

Can you please send us the file in question to support@pdf-xchange.com so that we can take a look at it from our end?
Also - are you using the Default or Enhanced OCR engine?

Kind regards,
Stefan

mmoni · Post by **mmoni** » Mon Jun 21, 2021 11:32 pm

I've tried both default and enhanced. I'll send you the problematic file in a moment.

Post by **TrackerSupp-Daniel** » Mon Jun 21, 2021 11:50 pm

Hi, mmoni

This document already contains text content, and as such, is ignored by the OCR process so as to prevent multiple overlapping text layers.

The reason that searching does not work is that the text (and particularly the used font information) in this file, is corrupted. This means that the data you see on the page visually, does not reflect the actual data in the document. As a comparison, this is what happens when copying the existing text to notepad:

This gibberish of broken characters, is what is seen by all applications when you try to search for any text within the file, and is why you are not getting any results from your search. Meanwhile the presence of "text" content, broken or otherwise, is preventing the OCR function from placing any new text in those positions.

To rectify this, you will need to "rasterize" the entire document, converting all existing text and other types of content, into a single image per page. The "rasterize pages" tool is located on the Convert tab. Once the Rasterization is complete, then you can run OCR, and will find that new OCR'd text is added to the document, which can be searched. Do note that OCR is not perfect, and it is possible that there will be some errors in text generated this way. If at all possible, it may be a better idea to try and locate an original version of this document which may not have corrupted font information to begin with.

Kind regards,

PHK · Post by **PHK** » Sun Aug 08, 2021 5:47 pm

Daniel, why is your recommendation "To rectify this, you will need to "rasterize" the entire document..." when you might just have a problem on some pages which could be fixed with targetted-pages rasterization?

Post by **Tracker Supp-Stefan** » Mon Aug 09, 2021 7:59 am

Hello PHK,

If it's only a few pages you have issues with - by all means - please only do the rasterization on those pages.

Alternatively - you can even open the content pane in the Editor, and for the affected page(s) - just remove the gibberish existing OCR text, and leave the original image as it is. Avoiding re-rasterizing the content will save you some image quality

Kind regards,
Stefan

I can't OCR a file

I can't OCR a file

Re: I can't OCR a file

Re: I can't OCR a file

Re: I can't OCR a file

Re: I can't OCR a file

Re: I can't OCR a file