OCR engine not detecting text and therefore not recognizing it

Forum for the PDF-XChange Editor - Free and Licensed Versions

Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan

Post Reply
JohnCLeBlanc
User
Posts: 18
Joined: Tue Jul 01, 2014 1:02 am
Contact:

OCR engine not detecting text and therefore not recognizing it

Post by JohnCLeBlanc »

Hi,
"Test OCR -- BEFORE.pdf" is an extract from a government publication that contains both text and images but is not searchable. I ran OCR (enhanced) and saved it as "Test OCR -- SEARCHABLE IMAGES.pdf".

Unfortunately, only the text in the headers/footers and graphics are searchable. This can be shown by searching for "2". It only shows up in the footers and figures.

Can the OCR engine be configured to convert the main text in the document?

btw, is there a way to tell if a pdf is searchable other than by trying to search for something? It would be nice to know at a glance whether or not I have to run OCR on it.
Attachments
Test OCR -- SEARCHABLE IMAGES.pdf
(283.41 KiB) Downloaded 61 times
Test OCR -- BEFORE.pdf
(276.25 KiB) Downloaded 57 times
Willy Van Nuffel
User
Posts: 2347
Joined: Wed Jan 18, 2006 12:10 pm

Re: OCR engine not detecting text and therefore not recognizing it

Post by Willy Van Nuffel »

To me it seems like there is something wrong with the embedded Cambria fonts in the document.
When you select text and copy this to (for example) Notepad, it is unreadable.
When you look at the text via the Content pane, the text seems to be missing.

Can you check with which software-application the faulty PDF has been made?
I suppose the original file (where you made the extraction of) was not made with PDF-XChange ?

A possible work-around for now, to make it "searchable", is to print the PDF 'as image' to a new PDF and then run OCR onto it (see attached file). However, take care, it will not be possible to directly edit it anymore.

To see if a PDF is only images and needs to be OCR'ed, is to look into the Document properties, at Fonts.
If there are no fonts at all, then the PDF is purely image and is not searchable.

The problem with your current PDF is rather a question for Tracker Support.
Attachments
Test OCR -- AFTER.pdf
(3.01 MiB) Downloaded 46 times
User avatar
PHK
User
Posts: 896
Joined: Tue Nov 24, 2020 4:02 pm

Re: OCR engine not detecting text and therefore not recognizing it

Post by PHK »

Willy Van Nuffel wrote: Wed Oct 06, 2021 6:26 pm ....A possible work-around for now, to make it "searchable", is to print the PDF 'as image' to a new PDF and then run OCR onto it...
This is the technique I find most useful. To elaborate more on Willy's suggestion, there is a tick box in the print pop-up box towards the lower left of the box that invites the user to "Print as Image". Click its box to its left to show a tick mark and proceed.

It seems to me the problem arises from the way the source document is published, not a shortcoming of this software. Some publishers have the wrong-headed notion that they should "protect" their output which renders the text unsearchable.
All best,

FringePhil
User avatar
Dimitar - Tracker Supp
Site Admin
Posts: 1778
Joined: Mon Jan 15, 2018 9:01 am

Re: OCR engine not detecting text and therefore not recognizing it

Post by Dimitar - Tracker Supp »

Hello JohnCLeBlanc,

It seems that you have used an older version of the product when you converted this file.

Could you please install the latest version of the PDF Editor and then try again to convert the file?

Also, the capabilities of the Default OCR tool are limited so my suggestion is to use the Enhanced OCR, which in version 9 products is based on ABBYY's OCR engine which is one of the best OCR applications out there.
btw, is there a way to tell if a pdf is searchable other than by trying to search for something? It would be nice to know at a glance whether or not I have to run OCR on it.
The easiest way is to try to select the text. If it is selectable, this means that it is already converted or the document originally has searchable/editable text.

Regards.
User avatar
PHK
User
Posts: 896
Joined: Tue Nov 24, 2020 4:02 pm

Re: OCR engine not detecting text and therefore not recognizing it

Post by PHK »

Dimitar - Tracker Supp wrote: Thu Oct 07, 2021 12:06 pm Hello JohnCLeBlanc,

It seems that you have used an older version of the product when you converted this file....
My version is bang up to date and I know exactly what JohnCLeBlanc is experiencing as my experiences are similar. As I wrote above, I do not believe it is a PDFXChange issue but rather a document formatting issue.
All best,

FringePhil
JohnCLeBlanc
User
Posts: 18
Joined: Tue Jul 01, 2014 1:02 am
Contact:

Re: OCR engine not detecting text and therefore not recognizing it

Post by JohnCLeBlanc »

Illuminating responses thanks. I looked at the original full document (attached) in case Tracker Support wants to see why my enhanced OCR scan failed. The document properties show no security and the list of fonts appears to be okay but clearly there is a problem with how the embedded font was interpreted. This document is from the Tanzanian Ministry of Health and was created 5 years ago. There is no way to find out how it was created.

I updated to the latest version of PDF-Xchange and tried it again. Same result. btw, I'm using the enhanced OCR engine, not the basic one.

I had assumed that OCR engine would treat the text as an image but perhaps it got fooled because there are characters and fonts; they're just unreadable to humans :-) Is that why this document was not treated as images? Shouldn't the OCR engine at least prompt the unsuspecting user that the document has an uninterpretable font and then offer to treat the pages as images? I had no idea I'd have to do an extra step to save a non-searchable document as a separate image file and then do OCR.

Couple of questions that have arisen out of this helpful discussion:
1. Even though I clicked the "Notify" if there is an update to this topic, I'm not getting them. I'm having to logon and check manually. What do I need to do to get an email notification that the topic has been updated?

2. My version of PDF--Xchange was out of date. Is there a way to be prompted when an update is available? All I see in Preferences is "Check for updates now".

Thanks!
Attachments
Tanzania Sharpened One Plan -- Tz RMNCH Plan 2014 to 2015.pdf
(2.99 MiB) Downloaded 42 times
User avatar
PHK
User
Posts: 896
Joined: Tue Nov 24, 2020 4:02 pm

Re: OCR engine not detecting text and therefore not recognizing it

Post by PHK »

JohnCLeBlanc wrote: Fri Oct 08, 2021 2:04 pm... Shouldn't the OCR engine at least prompt the unsuspecting user that the document has an uninterpretable font and then offer to treat the pages as images? ...
Fair point. Also, it would be nice to have a one-click function that did the conversion (print to PDF->OCR->save as new file) for these awkward files.

Interesting specimen file submitted by OP. See https://1drv.ms/b/s!AsBVtdXZAI89j8AtDA6PBEYMC9deCA?e=vX842A (too large for a simple attachment) for the converted equivalent.
All best,

FringePhil
mCHSNUg5Pz8cPap
User
Posts: 124
Joined: Wed Aug 04, 2021 6:36 pm

Re: OCR engine not detecting text and therefore not recognizing it

Post by mCHSNUg5Pz8cPap »

John LeBlanc,
I OCR'd your test document and it worked fine. I think what you need to do is make sure that you uncheck the box for "Ignore existing text on page." I did that and had no problems.
Screenshot.png
The other alternative is to rasterize the document--i.e., convert it into an image--and then OCR the image. You can rasterize the document with a press of a button ("Rasterize Pages" button in the Convert menu) so it's a little easier than printing as an image. That said, I don't know why you would do this when you can just uncheck the box so it does not ignore existing text on the page.
Features I really want:
1. Fully customizable toolbars: https://forum.tracker-software.com/viewtopic.php?p=167585
2. Sanitize documents w/o being forced into save as dialog (Acrobat has this!): https://forum.tracker-software.com/viewtopic.php?p=156130
User avatar
PHK
User
Posts: 896
Joined: Tue Nov 24, 2020 4:02 pm

Re: OCR engine not detecting text and therefore not recognizing it

Post by PHK »

mCHSNUg5Pz8cPap wrote: Sat Oct 09, 2021 12:50 am John LeBlanc,
I OCR'd your test document and it worked fine. I think what you need to do is make sure that you uncheck the box for "Ignore existing text on page." I did that and had no problems.

The other alternative is to rasterize the document--i.e., convert it into an image--and then OCR the image. You can rasterize the document with a press of a button ("Rasterize Pages" button in the Convert menu) so it's a little easier than printing as an image. That said, I don't know why you would do this when you can just uncheck the box so it does not ignore existing text on the page.
My OCR box has the ignore text box ticked but that does make the document OCRable for me.

Your rasterizing pages suggestion works well, if the security settings of the document permit as in the example of this thread. Otherwrise, one must go the print route.
All best,

FringePhil
mCHSNUg5Pz8cPap
User
Posts: 124
Joined: Wed Aug 04, 2021 6:36 pm

Re: OCR engine not detecting text and therefore not recognizing it

Post by mCHSNUg5Pz8cPap »

PHK wrote: Sun Oct 10, 2021 1:26 am My OCR box has the ignore text box ticked but that does make the document OCRable for me.
Well, that is strange. I took the document from the first post titled "Test OCR -- BEFORE" and OCR'd it with the settings shown above. It produced the document below, which I can now easily search. It's strange that it doesn't work for you and the OP.
Test OCR -- AFTER.pdf
(304.19 KiB) Downloaded 40 times
Features I really want:
1. Fully customizable toolbars: https://forum.tracker-software.com/viewtopic.php?p=167585
2. Sanitize documents w/o being forced into save as dialog (Acrobat has this!): https://forum.tracker-software.com/viewtopic.php?p=156130
User avatar
PHK
User
Posts: 896
Joined: Tue Nov 24, 2020 4:02 pm

Re: OCR engine not detecting text and therefore not recognizing it

Post by PHK »

mCHSNUg5Pz8cPap wrote: Mon Oct 11, 2021 7:10 pm

Well, that is strange. ..
No, sorry, my post must not have been sufficiently clear thereby causing your misunderstanding and I hope this helps.

I agree that rasterizing OP's file makes it OCR-able just as you posted. So, my results on that specimen file are the same as yours. But rasterizing will not work if a document's security settings do not permit, not the case in the specimen file. But if the file can be printed, it can be OCRed per the above.

PDF documents meant to be widely distributed are less likely to have security restrictions than perhaps others. For instance, one of my securities brokers produces my monthly statements that only I can access on their website (a very narrow distribution) with many document permissions set to "not allowed." Therefore, none of the apparent text is available as text in a PDF reader except the account number and who cares about that? That means that there is very little I can do with the statements except stare at them. I cannot delete empty pages, I cannot combine multiple monthly reports into a single file, I cannot OCR, I cannot do searches in the document, etc. And I cannot change the security settings without a password which, of course, they will not provide. I have queried them on the logic of all this but never got an answer. So, I have to print up a new PDF file with everything "allowed" before it is any use to me. These sorts of secured documents are obviously different from the specimen file of this thread.
All best,

FringePhil
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8436
Joined: Wed Jan 03, 2018 6:52 pm

Re: OCR engine not detecting text and therefore not recognizing it

Post by TrackerSupp-Daniel »

Hello, PHK

I am glad to see that the crux of the issue here seems to have been resolved through the above discussion. I did want to make one important note for you about printing to PDF though. As I am sure you have seen me mention in numerous other threads, this is never recommended, as the process of multiple conversions can damage the file in a way that is not immediately apparent, but leads to later corruption or data loss. Wherever possible, you should strive to avoid printing to PDF for this reason.

I am not sure if you have tried, but I would very highly recommend that you point this fact out to your brokers and request that, as you need it for your own purposes, they either provide you the password for editing, or send you an unlocked file to avoid this necessity.
We as a PDF software developer must respect the document security, and will not offer an option intended to circumvent it in any official capacity. You can do as you wish with files on your device, but we cannot help you remove the security.

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
User avatar
PHK
User
Posts: 896
Joined: Tue Nov 24, 2020 4:02 pm

Re: OCR engine not detecting text and therefore not recognizing it

Post by PHK »

TrackerSupp-Daniel wrote: Tue Oct 12, 2021 11:15 pm ...We as a PDF software developer must respect the document security, and will not offer an option intended to circumvent it in any official capacity. You can do as you wish with files on your device, but we cannot help you remove the security.
...
I fully understand your position and I am totally sympathetic and not asking that you do anything along those lines.

As for corrupting files by printing them, I am not sure I track fully your point. First of all, printing a new document based on a source file does not in any way alter that source document. I don't think I could if I wanted to. What the "printing" seems to be doing is to create a new document that has much of the content of the source document without destroying, altering, or even saving the source document. What I do with this new semi-mirrored PDF file seems to be up to me from there on. I can keep the downloaded source file but why waste storage space as I can go back to the broker at a later date and download the file again, should I have the need?

Creating semi-mirrored PDFs from originals via virtual printing is one of the most useful things PDF XChange Editor does for me.
All best,

FringePhil
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8436
Joined: Wed Jan 03, 2018 6:52 pm

Re: OCR engine not detecting text and therefore not recognizing it

Post by TrackerSupp-Daniel »

Hello, PHK

The original would not be affected, but due to the process of converting the document to a physical printer friendly format, and then immediately back to PDF, in some cases, it is possible for specific content items to be damaged in the newly created version of the file. Not necessarily corrupted on the spot, but ever so slightly damaged in a way that later becomes a problem due to all the variables present in the PDF format.

In any case, if you ever need a copy of a file, the "save as" function is plenty more than enough, and if you need to convert a page entirely to images, the "rasterize pages" function will suffice. With these two options, there should not ever be a reason, besides circumventing security options, in which printing a PDF directly back to PDF would be necessary (and frankly, the other two methods are simply faster, safer, and less impactful ways to accomplish the same thing).

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
User avatar
PHK
User
Posts: 896
Joined: Tue Nov 24, 2020 4:02 pm

Re: OCR engine not detecting text and therefore not recognizing it

Post by PHK »

I am missing something here, Daniel.

If I "copy as" a PDF file, it seems to bring along the security settings of the source document which do not allow many things including rasterizing pages so I do not see how that obviates the concerns you express.
All best,

FringePhil
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8436
Joined: Wed Jan 03, 2018 6:52 pm

Re: OCR engine not detecting text and therefore not recognizing it

Post by TrackerSupp-Daniel »

Hello, PHK

I never said that saving or rasterizing would be the solution in this specific case, only that officially, I cannot endorse circumventing document security in any way. I then proceeded to offer a fair warning of the possible issues you may encounter if you choose to take the route you mentioned before.

There was nothing for you to to miss, but I do think I may have overexplained and caused you to look too far into it, so my apologies for any confusion I may have caused.

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
User avatar
PHK
User
Posts: 896
Joined: Tue Nov 24, 2020 4:02 pm

Re: OCR engine not detecting text and therefore not recognizing it

Post by PHK »

No problem, Daniel.
All best,

FringePhil
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8436
Joined: Wed Jan 03, 2018 6:52 pm

OCR engine not detecting text and therefore not recognizing it

Post by TrackerSupp-Daniel »

:)
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Post Reply