OCR struggles to recognise "4" in this typewritten text

Forum for the PDF-XChange Editor - Free and Licensed Versions

Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan

sjm8-tf@aptile.co.uk
User
Posts: 6
Joined: Fri Nov 30, 2018 10:26 am

OCR struggles to recognise "4" in this typewritten text

Post by sjm8-tf@aptile.co.uk »

Hello,

I've been comparing OCR for scanned documents. In PDF tracker, The Enhanced Scanned Pages OCR does a generally accurate job of OCR, but an error that stands out is the very frequent failure to recognise the number 4 in some typewriter pages from the 1980s. Page and chapter numbers, and text like "45 degrees" all suffer. (Also, the degree symbol seems to be recognized better when the resolution is lower!)

I've attached a sample page in case it's helpful to you to evaluate any future OCR changes that you're making in PDF Xchange. It makes no difference if I align the text first, and very little difference to the OCR if I use the original greyscale, or convert it to black and white.
You do not have the required permissions to view the files attached to this post.
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8613
Joined: Wed Jan 03, 2018 6:52 pm

Re: OCR struggles to recognise "4" in this typewritten text

Post by TrackerSupp-Daniel »

Hello sjm8,

This appears to be due to the nonstandard format of the number. While the human eye can easily discern that is the number 4, a computer is trained to look for all possibilities, and then takes what it thinks is the most likely of those, {in my tests I saw )4, h, L, and l+ }. As OCR developers are focusing on our new ORC engine now, I must inform you that it is unlikely this will see much attention. When the new engine (which is much more robust) comes out, it should hopefully be able to interpret this correctly, and if not we will certainly take another look at what is causing the issue.

For now, I apologize for the inconvenience.

Kind Regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com