I've been using the enhanced OCR feature a fair bit lately. I am running Editor version 9.4.364.0, installed on 64-bit Windows 8.1.
Generally the enhanced OCR has been working great on documents I've scanned myself (300 dpi, restricted-colour-palette greyscale stippling, high-quality PDF) from books in a library. It's been pretty accurate, and reasonably fast — or, at least, much faster than the old OCR engine (the one that's available on portable installations).
The only feature I've missed (to a small extent) in doing that was that there seems to no longer (compared to the old engine) be selectable options for historical print forms: "Old English" printing in English, or "Fraktur" in German, for instance (i.e. "Blackletter"). Apparently this is a known limitation currently.
Anyway, after having a favourable experience with those documents, I applied the enhanced OCR feature to some old US patents. Specifically around numbers 281**** (1950's), 320**** (1960's), and 360**** (1960's). These were patents containing some technical drawings. Ultimately the output was fairly accurate: good on the text, but poor on the hand-lettered technical drawings (mediocre recognition of the hand-lettering, but spurious recognition of some drawing elements as text, which is somewhat to be expected).
My main concern was the time taken. The second-listed patent is only three pages long (the first page contains ten figures, the following couple of pages are full of text in two columns per page), but it took almost two minutes (actually 1:45) to complete the OCR.
My laptop is not new, but it was a very high-specification machine when I bought it, and the RAM has just been upgraded. FWIW, the OCR seemed to run on just one of the eight cores (I didn't investigate this in detail: I have surmised it from CPU usage of ~12.5%).
Anyway, I know that slow OCR is something that you would have heard about over and over again already, but I thought it might nevertheless be worth reporting in case there are some features of these particular documents that are contributing.
And hence whether for such documents there's something that either the user could address through modifying the settings (or document properties?!), or the software engineers could address by modification of the internal OCR preparation and/or recognition process.
To my mind some notable features of those old patents are:
- technical drawings containing line art — including, specifically, extensive hatching — and hand lettering;
- text in two columns per page (with line numbering); and
- large reported pages of low resolution — I presume that the original (paper) documents are roughly A4 size (perhaps foolscap or quarto??), but the PDF document information reports page sizes of 818.4 mm × 1202.3 mm for all three of the above patents — I thought that this might be because the scans were of unusually high-resolution, but if I edit one of the images, IrfanView reports it being 2320 pixels by 3408 pixels (similar to the scans I made at the library myself!), so I estimate the true scan resolution to be around 280 to 300 dots per inch based on the original paper document, but this is nominally coming out as a resolution of 72 dpi (a common low-resolution default) based on the PDF document properties.