OCR struggles to recognise "4" in this typewritten text

Forum for the PDF-XChange Editor - Free and Licensed Versions

Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Sean - Tracker, Paul - Tracker Supp, Chris - Tracker Supp, Tracker Supp-Stefan, Ivan - Tracker Software

Post Reply
sjm8-tf@aptile.co.uk
User
Posts: 6
Joined: Fri Nov 30, 2018 10:26 am

OCR struggles to recognise "4" in this typewritten text

Post by sjm8-tf@aptile.co.uk » Thu Dec 06, 2018 1:43 pm

Hello,

I've been comparing OCR for scanned documents. In PDF tracker, The Enhanced Scanned Pages OCR does a generally accurate job of OCR, but an error that stands out is the very frequent failure to recognise the number 4 in some typewriter pages from the 1980s. Page and chapter numbers, and text like "45 degrees" all suffer. (Also, the degree symbol seems to be recognized better when the resolution is lower!)

I've attached a sample page in case it's helpful to you to evaluate any future OCR changes that you're making in PDF Xchange. It makes no difference if I align the text first, and very little difference to the OCR if I use the original greyscale, or convert it to black and white.
Attachments
Singmaster_Notes_P19_OCR_Test.pdf
(52.04 KiB) Downloaded 12 times

User avatar
TrackerSupp-Daniel
Site Admin
Posts: 1803
Joined: Wed Jan 03, 2018 6:52 pm

Re: OCR struggles to recognise "4" in this typewritten text

Post by TrackerSupp-Daniel » Thu Dec 06, 2018 6:15 pm

Hello sjm8,

This appears to be due to the nonstandard format of the number. While the human eye can easily discern that is the number 4, a computer is trained to look for all possibilities, and then takes what it thinks is the most likely of those, {in my tests I saw )4, h, L, and l+ }. As OCR developers are focusing on our new ORC engine now, I must inform you that it is unlikely this will see much attention. When the new engine (which is much more robust) comes out, it should hopefully be able to interpret this correctly, and if not we will certainly take another look at what is causing the issue.

For now, I apologize for the inconvenience.

Kind Regards,
Daniel McIntyre
Support Technician
Tracker Software Products (Canada) LTD

Sales: +1 (250) 324-1621
Fax: +1 (250) 324-1623

Post Reply