I can OCR "fuzzy" text but can NOT OCR "clean" text

Forum for the PDF-XChange Editor - Free and Licensed Versions

Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Sean - Tracker, Paul - Tracker Supp, Chris - Tracker Supp, Tracker Supp-Stefan, Ivan - Tracker Software

Post Reply
DWC121
User
Posts: 44
Joined: Thu Jul 30, 2015 5:18 am

I can OCR "fuzzy" text but can NOT OCR "clean" text

Post by DWC121 » Tue Aug 13, 2019 11:22 pm

Sigh...

Several months ago I had a jpg on my PC with text that was fuzzy. Today I opened it in PDF-XChange and saved it as a pdf. I realized I had not applied OCR so I re-opened the pdf and applied OCR. Naturally, since the text was not clear, the OCR results were terrible.

The jpg and the pdf are attached as "A - Can OCR.jpg/pdf". NOTE - I did not save the pdf with the OCR. I figured someone else could play with the files.

Since the text was fuzzy, I used another program (irfanview) to clean the text in the jpg. I opened the new jpg in PDF-XChange and I applied OCR. What puzzles me is no text was recognized. It did not even detect what it thought were odd looking characters and symbols. I've used irfanview before to clean up fuzzy text and to remove blemishes and had no problems applying OCR.

The jpg and the pdf are attached as "A - Can NOT OCR.jpg/pdf". NOTE - I did not save the pdf with the OCR. I figured someone else could play with the files.

The original file was printed on a dot-matrix printer. Below are the settings I used when I tried to apply OCR both times. What am I doing wrong?
image.png
Attachments
B - Can NOT OCR.pdf
(551.51 KiB) Downloaded 10 times
B - Can NOT OCR.jpg
A - Can OCR.pdf
(564.42 KiB) Downloaded 7 times
A - Can OCR.jpg

User avatar
TrackerSupp-Daniel
Site Admin
Posts: 2421
Joined: Wed Jan 03, 2018 6:52 pm

Re: I can OCR "fuzzy" text but can NOT OCR "clean" text

Post by TrackerSupp-Daniel » Tue Aug 13, 2019 11:50 pm

Hello DWC121,

Thank you for the report, I am seeing this here as well, Choosing high or Low accuracy seems to allow the OCR to work, although the results are not good.
Note that the OCR engine operates best with black on white text, so any colored or less dark shades of grey will cause it to have more difficulties. I have forwarded this to our Dev team for investigation, and will get back to you once I hear from them.

Kind regards,
Daniel McIntyre
Support Technician
Tracker Software Products (Canada) LTD

Sales: +1 (250) 324-1621
Fax: +1 (250) 324-1623

User avatar
TrackerSupp-Daniel
Site Admin
Posts: 2421
Joined: Wed Jan 03, 2018 6:52 pm

Re: I can OCR "fuzzy" text but can NOT OCR "clean" text

Post by TrackerSupp-Daniel » Wed Aug 14, 2019 12:28 am

After speaking with the Dev team we have created a ticket for this matter, as below:
RT#4873: EOCR cannot see "clean" text, works on "Fuzzy" text.

You can ask any member of our support team for an update on the progess, however in the meantime I unfortunately do not have any workarounds to resolve this. As I mentioned before, changing the accuracy setting allows some text to be found, but the results are lacking any semblance of sense. Changing the engine to the Old OCR engine gives slightly better results, but nothing good enough to be considered useful...

Hopefully we can find a swift resolution for this, however as we will need to bring this to the LeadTools OCR team so they can resolve the issues in their plugin before we can implement it here, it may be a while.

Kind regards,
Daniel McIntyre
Support Technician
Tracker Software Products (Canada) LTD

Sales: +1 (250) 324-1621
Fax: +1 (250) 324-1623

DWC121
User
Posts: 44
Joined: Thu Jul 30, 2015 5:18 am

Re: I can OCR "fuzzy" text but can NOT OCR "clean" text

Post by DWC121 » Wed Aug 14, 2019 1:59 am

Daniel,

Thank you for your reply.

You stated...
"Note that the OCR engine operates best with black on white text, so any colored or less dark shades of grey will cause it to have more difficulties."

That is why I cleaned up the text in one of the jpg's. Most of the text was darker and crisper so I thought sure OCR would find something. No luck. Not even one letter was found. :(

The original jpg had a very wide variety of greys. OCR worked but as expected, a lot of strange characters and symbols were found.

By the way... you may have noticed I mis-labeled the files in the text of my posting. The "A" in "A - Can NOT OCR.jpg/pdf" should have been "B - Can NOT OCR.jpg/pdf"

User avatar
TrackerSupp-Daniel
Site Admin
Posts: 2421
Joined: Wed Jan 03, 2018 6:52 pm

Re: I can OCR "fuzzy" text but can NOT OCR "clean" text

Post by TrackerSupp-Daniel » Wed Aug 14, 2019 5:51 pm

Hello DWC121,

Indeed this is certainly an odd case, thankfully, we have reproduced it, and are working with the LeadTools devs to have it resolved, hopefully sooner than later.

I did notice the discrepancy in your text there, but was able to figure out what you meant while I was testing and creating the ticket, so no worries there.

Have an excellent day!
Daniel McIntyre
Support Technician
Tracker Software Products (Canada) LTD

Sales: +1 (250) 324-1621
Fax: +1 (250) 324-1623

Timur Born
User
Posts: 613
Joined: Tue Jun 26, 2012 1:50 pm

Re: I can OCR "fuzzy" text but can NOT OCR "clean" text

Post by Timur Born » Mon Aug 19, 2019 12:15 pm

This does not seem to be a problem of the OCR module, but of how XChange handles the PDF document.

- When you copy and paste the image inside the same document then XChange will at least try to OCR it, as in create a new document when you tell it to. There will be no OCR text, though, but only the image. Trying to paste the image onto a newly created empty page inside the same document does not improve things.

- When you extract the page into a new document then XChange will try to OCR it, but only creating an image document without OCR text, same as above.

- If you copy and paste the image into a new/different document then XChange will properly OCR it! So something is wrong with the PDF file or with how XChange handles the file.

User avatar
TrackerSupp-Daniel
Site Admin
Posts: 2421
Joined: Wed Jan 03, 2018 6:52 pm

Re: I can OCR "fuzzy" text but can NOT OCR "clean" text

Post by TrackerSupp-Daniel » Mon Aug 19, 2019 6:37 pm

Hello Timur,

I was unable to get the image OCR'd when copying it to a new document, or alternative document where OCR has worked in the past. Could you please verify that you were testing with the file "B - Can not OCR"? In the case you were using the correct file, can I ask you to send us a copy of this new document (before and after) as well as your current OCR settings?

We are certain that the EOCR plugin is the root of the issue, as the old "Default" OCR module does work with this document, while only the New EOCR Plugin does not.

Kind regards,
Daniel McIntyre
Support Technician
Tracker Software Products (Canada) LTD

Sales: +1 (250) 324-1621
Fax: +1 (250) 324-1623

Timur Born
User
Posts: 613
Joined: Tue Jun 26, 2012 1:50 pm

Re: I can OCR "fuzzy" text but can NOT OCR "clean" text

Post by Timur Born » Tue Aug 20, 2019 9:46 am

Absolutely used the "not" example and tried around different methods, including conversion to grayscale, because there is a blue/purple tint. I tried OCR with both all options enabled and all options disabled. Both Low and High do work and Medium does not.

Once I copy and paste the image into a new document (CTRL-N) I also get OCR results for Medium. Now you wrote that this does not work for you, so there must be a difference between our setups. I found said difference: paper size. My document was created using A4, yours likely was created at original Letter size. And this makes all the difference between OCR Medium working or not working.

But there is more: I can make Letter size work, too. When I paste the image into Paint.Net and then copy & paste back into XChange then XChange inserts the image as stamp (no idea why?). Once I flatten said stamp I can OCR the image.

And then there is more: When I do the stamp and flatten using Photoshop then I can control the image resolution in pixels/inch without changing the image size (pixels). Turns out that on a Letter page OCR Medium does not work for resolutions between 145 to 157 ppi. For A4 this "does not work" range changes to 147 - 154 ppi. I suspect that the original scan uses a ppi that works for A4 but not for Letter.

(Yeah, I make a living analyzing complex software/hardware problems. Just tell me where to send the invoice. ;P)

User avatar
TrackerSupp-Daniel
Site Admin
Posts: 2421
Joined: Wed Jan 03, 2018 6:52 pm

Re: I can OCR "fuzzy" text but can NOT OCR "clean" text

Post by TrackerSupp-Daniel » Tue Aug 20, 2019 5:04 pm

Hello Timur,
Thank you very much for this! You are correct that my default page was letter, and in testing with A4 I am also able to find text with a medium accuracy scan.
I have Added this information to the Development ticket, as well as informed our OCR Team Lead directly, but I cannot provide a timeline on the resolution at this time.

(Unfortunately Im just your typical Tech Support guy, so I dont know where our invoices need to go :lol: Try again next time!)
Daniel McIntyre
Support Technician
Tracker Software Products (Canada) LTD

Sales: +1 (250) 324-1621
Fax: +1 (250) 324-1623

Timur Born
User
Posts: 613
Joined: Tue Jun 26, 2012 1:50 pm

Re: I can OCR "fuzzy" text but can NOT OCR "clean" text

Post by Timur Born » Tue Aug 20, 2019 9:33 pm

Why are images from within XChange pasted as images, but outside ones are pasted as stamps?

And to clarify: The "does not work" ppi range means that any image with ppi higher or lower is handled properly by OCR Medium.

User avatar
TrackerSupp-Daniel
Site Admin
Posts: 2421
Joined: Wed Jan 03, 2018 6:52 pm

Re: I can OCR "fuzzy" text but can NOT OCR "clean" text

Post by TrackerSupp-Daniel » Tue Aug 20, 2019 10:15 pm

Hi Timur,

When moving any content within the application, we include some extra information in the clipboard so that we can paste it into another document or page in out application without making any alterations whatsoever, as sometimes copying content into the clipboard can cause unexpected recompression and the like.
As for why content from outside is sometimes pasted as a stamp, This is because when you copy something, the clipboard can hold multiple different instances of it, and during the paste process the Editor takes in as much information as it can. In some instances, you may copy from a and application and find when pasting into the Editor, That is comes out as a stamp because there are multiple content items within the clipboard information. By flattening this stamp, you can split apart the content. Much like with the image from paint, although in that case there is only one content item, being a single image.

Kind regards,
Daniel McIntyre
Support Technician
Tracker Software Products (Canada) LTD

Sales: +1 (250) 324-1621
Fax: +1 (250) 324-1623

Timur Born
User
Posts: 613
Joined: Tue Jun 26, 2012 1:50 pm

Re: I can OCR "fuzzy" text but can NOT OCR "clean" text

Post by Timur Born » Wed Aug 21, 2019 9:29 am

On a side-note: This thread does not appear under the "Your posts" listing. The last thread/post listed there is the "White image..." one from Aug 19th.

User avatar
TrackerSupp-Daniel
Site Admin
Posts: 2421
Joined: Wed Jan 03, 2018 6:52 pm

Re: I can OCR "fuzzy" text but can NOT OCR "clean" text

Post by TrackerSupp-Daniel » Wed Aug 21, 2019 11:35 pm

Hmmm, Curious.

I see this too, Though I am unsure why it has happened. I will report this to our web Devs and see what they think.

Regards,
Daniel McIntyre
Support Technician
Tracker Software Products (Canada) LTD

Sales: +1 (250) 324-1621
Fax: +1 (250) 324-1623

Post Reply