Slow performance of enhanced OCR on (old) US patents

Discussion for the End User use of OCR in PDF-XChange Editor and Viewer

Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan

Post Reply
DIV
User
Posts: 252
Joined: Fri Jun 23, 2017 1:47 am

Slow performance of enhanced OCR on (old) US patents

Post by DIV »

Hi there.

I've been using the enhanced OCR feature a fair bit lately. I am running Editor version 9.4.364.0, installed on 64-bit Windows 8.1.

Generally the enhanced OCR has been working great on documents I've scanned myself (300 dpi, restricted-colour-palette greyscale stippling, high-quality PDF) from books in a library. It's been pretty accurate, and reasonably fast — or, at least, much faster than the old OCR engine (the one that's available on portable installations).
The only feature I've missed (to a small extent) in doing that was that there seems to no longer (compared to the old engine) be selectable options for historical print forms: "Old English" printing in English, or "Fraktur" in German, for instance (i.e. "Blackletter"). Apparently this is a known limitation currently.

Anyway, after having a favourable experience with those documents, I applied the enhanced OCR feature to some old US patents. Specifically around numbers 281**** (1950's), 320**** (1960's), and 360**** (1960's). These were patents containing some technical drawings. Ultimately the output was fairly accurate: good on the text, but poor on the hand-lettered technical drawings (mediocre recognition of the hand-lettering, but spurious recognition of some drawing elements as text, which is somewhat to be expected).
My main concern was the time taken. The second-listed patent is only three pages long (the first page contains ten figures, the following couple of pages are full of text in two columns per page), but it took almost two minutes (actually 1:45) to complete the OCR.
My laptop is not new, but it was a very high-specification machine when I bought it, and the RAM has just been upgraded. FWIW, the OCR seemed to run on just one of the eight cores (I didn't investigate this in detail: I have surmised it from CPU usage of ~12.5%).

Anyway, I know that slow OCR is something that you would have heard about over and over again already, but I thought it might nevertheless be worth reporting in case there are some features of these particular documents that are contributing.
And hence whether for such documents there's something that either the user could address through modifying the settings (or document properties?!), or the software engineers could address by modification of the internal OCR preparation and/or recognition process.

To my mind some notable features of those old patents are:
  • technical drawings containing line art — including, specifically, extensive hatching — and hand lettering;
  • text in two columns per page (with line numbering); and
  • large reported pages of low resolution — I presume that the original (paper) documents are roughly A4 size (perhaps foolscap or quarto??), but the PDF document information reports page sizes of 818.4 mm × 1202.3 mm for all three of the above patents — I thought that this might be because the scans were of unusually high-resolution, but if I edit one of the images, IrfanView reports it being 2320 pixels by 3408 pixels (similar to the scans I made at the library myself!), so I estimate the true scan resolution to be around 280 to 300 dots per inch based on the original paper document, but this is nominally coming out as a resolution of 72 dpi (a common low-resolution default) based on the PDF document properties.
I look forward to your comments and/or suggestions.

—DIV
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17824
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: Slow performance of enhanced OCR on (old) US patents

Post by Tracker Supp-Stefan »

Hello DIV,

You did mention some specific documents, but I can't find them listed. can I please have an exact patent number (or better yet the actual PDF file) so that we can run some tests here on our end?

Also - can you please try updating to build 366 - does that work better/faster?

Kind regards,
Stefan
DIV
User
Posts: 252
Joined: Fri Jun 23, 2017 1:47 am

Re: Slow performance of enhanced OCR on (old) US patents

Post by DIV »

Specifically around numbers 281**** (1950's), 320**** (1960's), and 360**** (1960's).
Specifically around numbers 281xxxx (1950's), 320xxxx (1960's), and 360xxxx (1960's).
I don't think this is unique to the particular patents I looked at. I was just giving you an indication of the time period.
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17824
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: Slow performance of enhanced OCR on (old) US patents

Post by Tracker Supp-Stefan »

Hello DIV,

Yes - I can grab any file from those periods - but it will really help if we keep things consistent and I am testing with the same source file as yourself - so can I have a copy of that PDF please?

Kind regards,
Stefan
DIV
User
Posts: 252
Joined: Fri Jun 23, 2017 1:47 am

Re: Slow performance of enhanced OCR on (old) US patents

Post by DIV »

The sample timing given was specifically for US 3206958 (1965). Looking at it again, it appears that I incorrectly described it as having 10 figures on the first page: actually there are 12.

These were the settings used (I believe):
image.png
—DIV
DIV
User
Posts: 252
Joined: Fri Jun 23, 2017 1:47 am

Re: Slow performance of enhanced OCR on (old) US patents

Post by DIV »

A substantial part of this seems to be the initial overhead of "Recognition..." — setting up the OCR.
image.png
In a 31-page US patent (this time from March 2023!), it took a full three minutes to get through the INITIAL set-up (as above), to the second page, as below:
image(1).png
Could that exceedingly long time for initial set up and OCR of the first page be deciding the size of text and font style, the contrast and so on, which it then uses on subsequent pages? In this document the first page is about half text, and half 'greyscale shaded' line drawing, plus a barcode in the top-right corner. (In the PDF the shading is actually implemented as pixel-level black-and-white stippling.)

Below is the status at 8 minutes in:
image(2).png
Page 1 — see description above.
Pages 2 to 8 are primarily 'greyscale shaded' drawings (B&W stippled), with text labels within images rotated 90° on the page.
Pages 9 to 22 are primarily 'B&W line' drawings (actually B&W stippled/rasterised), with text labels within images rotated 90° on the page.
Pages 23 to 31 are purely text (B&W rasterised).

TOTAL time taken for all 31 pages was almost exactly fifteen minutes.

The job seemed to run on 2 of my 8 cores. (I am not doing anything else computationally intensive at the moment.)
RAM usage rose from roughly 9 GB to 12 GB (out of 16 GB).

The PDF document information reports page sizes of 903.1 mm × 1164.2 mm for this 2023 US patent (similar to the size of the older patents, albeit a slightly different aspect ratio). I wonder whether these large page sizes could be an important factor.

—DIV
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8437
Joined: Wed Jan 03, 2018 6:52 pm

Re: Slow performance of enhanced OCR on (old) US patents

Post by TrackerSupp-Daniel »

Hello, DIV

I was about to check in with the Dev team for their input when I realized that the patent number you sent earlier is only a 3 page document, which completed its OCR process in under 40 seconds. When i tried to look for patents in march 2023, I came to realize there are dozens, if not hundreds each day in that database, Could I please ask again that you share with us the exact document you are looking at when you report issues like this, so that we can quickly and accurately try to reproduce what you are describing and speak with the dev team about it.

Once we have the same file you are using here to test with, I will be able to bring your questions to the team and get you some answers.

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Post Reply