OCR Accuracy: Auto vs High [Paperless Office]

Discussion for the End User use of OCR in PDF-XChange Editor and Viewer

Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan

Post Reply
User avatar
patrickm
User
Posts: 25
Joined: Wed Oct 13, 2021 5:43 am
Location: Los Angeles, CA

OCR Accuracy: Auto vs High [Paperless Office]

Post by patrickm »

I desire the best possible OCR result. And this post ([https://forum.pdf-xchange.com/viewtopic.php?f=63&t=35943]) seems to suggest that using an Accuracy of Auto is better then using High.

Could you please confirm if that is correct?

Thank you,
Patrick
Last edited by patrickm on Mon Jan 23, 2023 10:29 pm, edited 1 time in total.
User avatar
John - Tracker Supp
Site Admin
Posts: 5219
Joined: Tue Jun 29, 2004 10:34 am
Location: United Kingdom
Contact:

Re: OCR Accuracy: Auto vs High

Post by John - Tracker Supp »

Yes that is correct - some pre-analysis of the file is done in the 'Auto' mode which is obviously not done when you simply select high and despite it defying logic - High will NOT always produce the best results as many factors affect this.
If posting files to this forum - you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded - thank you.

Best regards
Tracker Support
http://www.tracker-software.com
User avatar
patrickm
User
Posts: 25
Joined: Wed Oct 13, 2021 5:43 am
Location: Los Angeles, CA

Re: OCR Accuracy: Auto vs High

Post by patrickm »

Got it. My intuitive concern with Auto was that it might prioritize speed over accuracy.

Maybe renaming it to "Prioritize Speed" and "Prioritize Accuracy" would be more accurate? :)
User avatar
Dimitar - Tracker Supp
Site Admin
Posts: 1778
Joined: Mon Jan 15, 2018 9:01 am

Re: OCR Accuracy: Auto vs High

Post by Dimitar - Tracker Supp »

Hi,

Thank you for your suggestion.

I will forward it to our team of developers for consideration.

Regards.
DIV
User
Posts: 252
Joined: Fri Jun 23, 2017 1:47 am

Re: OCR Accuracy: Auto vs High

Post by DIV »

Without rehashing what I've already said here
https://forum.pdf-xchange.com/viewtopic.php?f=63&t=37455&p=158413#p158413
yes, I think rewording is required.
And I'm glad to see I'm not the only one who feels that the existing phrasing might seem to "defy logic" :-)

Now, with regard to the "Auto" setting, I still don't know for sure what it does, but I think it is neither prioritising speed nor prioritising output quality (which is nominally always set to 'maximum', according to the above-linked thread).

Per the thread linked above, I am guessing that Auto will be a multi-step procedure:
  1. analyse input document to determine resolution, font sizes, imperfections (such as blur or speckle);
  2. categorise the quality of images in the input document;
  3. run the OCR with the so-called "Accuracy" of images in the input document set to the above-selected category.
So this couldn't be faster than specifying the input image quality yourself.
Also, the output should be better than a user who manually chooses an inappropriate category, but should be either as good as or worse than the output for a user who manually chooses the most appropriate category.

Please confirm or correct.
And I furthermore suggest adding a brief (non-technical) description to the help page https://help.pdf-xchange.com/pdfxe9/ocr-pages_ed.html — currently "Auto" is not mentioned at all.

—DIV
User avatar
Paul - Tracker Supp
Site Admin
Posts: 6831
Joined: Wed Mar 25, 2009 10:37 pm
Location: Chemainus, Canada
Contact:

Re: OCR Accuracy: Auto vs High

Post by Paul - Tracker Supp »

Seeing there is a lot of discussion here on the forums about this we will be having a conversation here about it internally.

It's not a super high priority so will likely be brought up at a regular development meeting. I am mentioning that here because I don't expect a decision today.

We'll have to see how this discussion pans out in the next few weeks.
Best regards

Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
User avatar
Vasyl-Tracker Dev Team
Site Admin
Posts: 2352
Joined: Thu Jun 30, 2005 4:11 pm
Location: Canada

Re: OCR Accuracy: Auto vs High

Post by Vasyl-Tracker Dev Team »

Hi DIV.
Per the thread linked above, I am guessing that Auto will be a multi-step procedure:
analyse input document to determine resolution, font sizes, imperfections (such as blur or speckle);
categorise the quality of images in the input document;
run the OCR with the so-called "Accuracy" of images in the input document set to the above-selected category.
So this couldn't be faster than specifying the input image quality yourself.
Not exactly. The OCR works with raster images and only with rasters, while pdf-page may contain rasters(many), text, and graphics. So the application must 'convert' pdf-page to a corresponding raster image and then recognize it and then apply OCR-result back to the existing pdf-content on the page.

At the moment the Accuracy=Auto means that the application has permission to try to OCR existing images on the pages in case:
1. when one pdf-page contains one raster image only. It's the typical situation with 'scanned pdf's' - as a result of simple scanning paper documents.
2. when such single image has enough resolution, at least 300 dpi (it is often used for scanning).
3. when such single image isn't distorted too much by advanced geometrical transformation: rotated and scaled are allowed only, but not sloped for example.

Otherwise, when:

Accuracy≠Auto
or
any condition above isn't met - then the application might decide to rasterize the whole pdf-page and use the resulting image in the recognition process. For sure, this additional rasterization might(will in most cases) reduce the performance of the recognition process, in common terms.

And when:
Accuracy=High - it forces the application to ensure 600(±50) dpi for each image that will be processed by OCR
Accuracy=Medium - it forces the application to ensure 400(±50) dpi for each image that will be processed by OCR
Accuracy=Low - it forces the application to ensure 150(±50) dpi for each imag that will be processed by OCR

Cheers.
Vasyl Yaremyn
Tracker Software Products
Project Developer

Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
DIV
User
Posts: 252
Joined: Fri Jun 23, 2017 1:47 am

Re: OCR Accuracy: Auto vs High

Post by DIV »

Thanks, Vasyl, for a detailed technical insight into what the various configurations would yield!

Just to confirm, when so-called Accuracy is not set to Auto, then does it mean that images (maybe of various resolutions) on the page that are below the implied threshold resolutions (e.g. ~400 dpi for "Medium") would be resampled to increase their resolution to the specified level?

It kind of seems to me now that these settings are still inherently about setting the OCR analysis.
Reviewing the phrasing in the GUI is still worthwhile.

Following my current understanding, other options for the dialogue box phrasing (besides my previous suggestions) could therefore be something like:
  • No upsampling [replaces "Auto"]
  • Upsample to 150 dpi minimum [replaces "Low"]
  • Upsample to 400 dpi minimum [replaces "Medium"]
  • Upsample to 600 dpi minimum [replaces "High"]
Again, that doesn't have to be the precise wording, but just throwing in some more options to spark better ideas. At first glance such wording is more precise (better), but also rather technical (worse?).

Note: I'm assuming above that downsampling doesn't occur.

—DIV
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17823
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: OCR Accuracy: Auto vs High

Post by Tracker Supp-Stefan »

Hello DIV,

Thanks for getting back to us!
Indeed we are still continuing discussion on this internally in our team (but off the forums). If a decision is made for any changes here - we might include those in a future build.

Kind regards,
Stefan
Post Reply