Slow performance of enhanced OCR on (old) US patents

Discussion for the End User use of OCR in PDF-XChange Editor and Viewer

Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Tracker Supp-Stefan, Ivan - Tracker Software

Post Reply
DIV
User
Posts: 204
Joined: Fri Jun 23, 2017 1:47 am

Slow performance of enhanced OCR on (old) US patents

Post by DIV »

Hi there.

I've been using the enhanced OCR feature a fair bit lately. I am running Editor version 9.4.364.0, installed on 64-bit Windows 8.1.

Generally the enhanced OCR has been working great on documents I've scanned myself (300 dpi, restricted-colour-palette greyscale stippling, high-quality PDF) from books in a library. It's been pretty accurate, and reasonably fast — or, at least, much faster than the old OCR engine (the one that's available on portable installations).
The only feature I've missed (to a small extent) in doing that was that there seems to no longer (compared to the old engine) be selectable options for historical print forms: "Old English" printing in English, or "Fraktur" in German, for instance (i.e. "Blackletter"). Apparently this is a known limitation currently.

Anyway, after having a favourable experience with those documents, I applied the enhanced OCR feature to some old US patents. Specifically around numbers 281**** (1950's), 320**** (1960's), and 360**** (1960's). These were patents containing some technical drawings. Ultimately the output was fairly accurate: good on the text, but poor on the hand-lettered technical drawings (mediocre recognition of the hand-lettering, but spurious recognition of some drawing elements as text, which is somewhat to be expected).
My main concern was the time taken. The second-listed patent is only three pages long (the first page contains ten figures, the following couple of pages are full of text in two columns per page), but it took almost two minutes (actually 1:45) to complete the OCR.
My laptop is not new, but it was a very high-specification machine when I bought it, and the RAM has just been upgraded. FWIW, the OCR seemed to run on just one of the eight cores (I didn't investigate this in detail: I have surmised it from CPU usage of ~12.5%).

Anyway, I know that slow OCR is something that you would have heard about over and over again already, but I thought it might nevertheless be worth reporting in case there are some features of these particular documents that are contributing.
And hence whether for such documents there's something that either the user could address through modifying the settings (or document properties?!), or the software engineers could address by modification of the internal OCR preparation and/or recognition process.

To my mind some notable features of those old patents are:
  • technical drawings containing line art — including, specifically, extensive hatching — and hand lettering;
  • text in two columns per page (with line numbering); and
  • large reported pages of low resolution — I presume that the original (paper) documents are roughly A4 size (perhaps foolscap or quarto??), but the PDF document information reports page sizes of 818.4 mm × 1202.3 mm for all three of the above patents — I thought that this might be because the scans were of unusually high-resolution, but if I edit one of the images, IrfanView reports it being 2320 pixels by 3408 pixels (similar to the scans I made at the library myself!), so I estimate the true scan resolution to be around 280 to 300 dots per inch based on the original paper document, but this is nominally coming out as a resolution of 72 dpi (a common low-resolution default) based on the PDF document properties.
I look forward to your comments and/or suggestions.

—DIV
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 16190
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: Slow performance of enhanced OCR on (old) US patents

Post by Tracker Supp-Stefan »

Hello DIV,

You did mention some specific documents, but I can't find them listed. can I please have an exact patent number (or better yet the actual PDF file) so that we can run some tests here on our end?

Also - can you please try updating to build 366 - does that work better/faster?

Kind regards,
Stefan
DIV
User
Posts: 204
Joined: Fri Jun 23, 2017 1:47 am

Re: Slow performance of enhanced OCR on (old) US patents

Post by DIV »

Specifically around numbers 281**** (1950's), 320**** (1960's), and 360**** (1960's).
Specifically around numbers 281xxxx (1950's), 320xxxx (1960's), and 360xxxx (1960's).
I don't think this is unique to the particular patents I looked at. I was just giving you an indication of the time period.
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 16190
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: Slow performance of enhanced OCR on (old) US patents

Post by Tracker Supp-Stefan »

Hello DIV,

Yes - I can grab any file from those periods - but it will really help if we keep things consistent and I am testing with the same source file as yourself - so can I have a copy of that PDF please?

Kind regards,
Stefan
Post Reply