OCR Library result tweaking

PDF-X OCR SDK is a New product from us and intended to compliment our existing PDF and Imaging Tools to provide the Developer with an expanding set of professional tools for Optical Character Recognition tasks

Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Tracker Supp-Stefan

Post Reply
scdawson
User
Posts: 43
Joined: Thu Oct 20, 2011 3:40 pm

OCR Library result tweaking

Post by scdawson »

I am evaluating the OCR library to process CAD drawings, and am getting mixed results. As an example, one of the things that I am encountering is that often a word will be detected with spaces between the letters. For example, "DESCRIPTION" will be OCRed as "D E S C R I P T I O N".

Is there any way to tweak these results? For example, modify the parameter files when we know better what font will be used? Or, give the library a dictionary of words that it might be likely to find. Can it be trained in any way?

Thanks!

Shaun
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17824
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: OCR Library result tweaking

Post by Tracker Supp-Stefan »

Hello Shaun,

Below is the explanation from one of our OCR developers:

"Added and missing whitespaces"
You will find that if you use the SDK functions OCR_GetText and OCR_GetField that word formatting and whitespace placement will be very good. The OCR_MakeSearchable function is meant primarily for producing PDFs that you can search - not for text extraction. It places characters one-at-a-time in the output PDF, in the way that best matches their position on the scanned page, and, consequently the word spacing is determined by the viewer (whether PDF-XChange or any of our competitors) by analyzing the position of characters on the page and making assumptions about word connectivity. This design works well for searchable PDFs as it allows "virtual" character placement and size to almost exactly match the input image's character placements, so that highlighting of recognized words when you search will almost exactly match the image page contents. All common PDF readers (including PDF-XChange Viewer) will properly match words / strings regardless of assumed spacing (ie, a search for POTATO will match "POT A TO" and highlight POTATO in the searchable PDF image). The trade-off is that if you cut & paste text from a searchable PDF document, you will find mistakes in the assumed spacing. The recommended method of retrieving plain text is to use the SDK functions OCR_GetText and OCR_GetField(s) instead, which will return normally formatted plain text with proper word spacing. Font, language, and source quality greatly impact this as well.

Best,
Stefan
scdawson
User
Posts: 43
Joined: Thu Oct 20, 2011 3:40 pm

Re: OCR Library result tweaking

Post by scdawson »

Ah, interesting. That makes sense, but doesn't completely fit with what I'm getting. For example, the word 'DESCRIPTION' appears on both pages of the document that I'm testing. However, if I Search for the text 'description' using Acrobat, that text is only found once, on the second page. The first page is missed.

If I copy and paste the word from the first page, I get:

DE S C RI P T I O N

On the second page, I get:

DESCRIPTION

Only the one on the second page got found. That's why I thought it had something to do with the spacing, but if what you say is true, it doesn't. Any idea why the word is not getting found on the first page?

Thanks!

Shaun
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17824
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: OCR Library result tweaking

Post by Tracker Supp-Stefan »

Hello Shaun,

Maybe the search in Adobe works differently, but I believe that our Viewer should find also the
DE S C RI P T I O N
from your first page.

Have you tried that?

And can we get the sample before and after you OCR it?

Best,
Stefan
scdawson
User
Posts: 43
Joined: Thu Oct 20, 2011 3:40 pm

Re: OCR Library result tweaking

Post by scdawson »

I can't share the document publicly, but I can share it privately. How should I send it to you in that case?

I have not tried it with the tracker viewer, yet, but that is on my list to try :).

Thanks!

Shaun
User avatar
Paul - Tracker Supp
Site Admin
Posts: 6837
Joined: Wed Mar 25, 2009 10:37 pm
Location: Chemainus, Canada
Contact:

Re: OCR Library result tweaking

Post by Paul - Tracker Supp »

Hi scdawson,

you can send it to support@pdf-xchange.com as an attachment if it's under 20MB zipped. If larger please still email us and I'll send you some FTP credentials to upload it to.

Please be sure you refer to this thread by URL in your email so we can cross reference it.

:-)
Best regards

Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
scdawson
User
Posts: 43
Joined: Thu Oct 20, 2011 3:40 pm

Re: OCR Library result tweaking

Post by scdawson »

I've tried the PDF-Exchange viewer, and sure enough, it works as you describe. For this particular solution, the documents are going to be indexed by a search engine, so the real important thing is that the search engine can find the appropriate words. Do you have any sense of how I might get a comfort level with that without having access to the search engine itself?

Thanks,

Shaun
User avatar
Paul - Tracker Supp
Site Admin
Posts: 6837
Joined: Wed Mar 25, 2009 10:37 pm
Location: Chemainus, Canada
Contact:

Re: OCR Library result tweaking

Post by Paul - Tracker Supp »

Hi Shaun,

the devs are actively working on this spacing issue this week. We'll keep you posted.

hth
Best regards

Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
scdawson
User
Posts: 43
Joined: Thu Oct 20, 2011 3:40 pm

Re: OCR Library result tweaking

Post by scdawson »

Fantastic. Thanks, Paul!

Shaun
User avatar
Paul - Tracker Supp
Site Admin
Posts: 6837
Joined: Wed Mar 25, 2009 10:37 pm
Location: Chemainus, Canada
Contact:

Re: OCR Library result tweaking

Post by Paul - Tracker Supp »

:wink:
Best regards

Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
scdawson
User
Posts: 43
Joined: Thu Oct 20, 2011 3:40 pm

Re: OCR Library result tweaking

Post by scdawson »

Any news?
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: OCR Library result tweaking

Post by Walter-Tracker Supp »

scdawson wrote:Any news?
Yes. We have improved placement in the current development version (to be available in the coming days), but are still tweaking it to improve output accuracy even more. One issue that came up in your sample document was that other elements on the page sometimes confuse the detection of line and word boundaries, which impacts text size determination. With a couple of days of development we have made some improvement in this area already, but expect further improvement in the coming days. I would expect a new build in about two weeks, possibly sooner. It will also feature a couple of additional image pre-processing options to deal with 1-bit scans and other poor quality input scans.

A little bit of an explanation as to why this spacing issue happens is in order. Word connectivity in documents is determined by the PDF viewer (whether ours, or our competitors'), based on the position of characters, the font used, and the font size. There is a slight margin of error, but essentially the right side of the nth character must be touching the left side of the n+1-th character. To accurately reproduce the spacing of words, we must exactly identify the font and the size. Of course we cannot always perfectly match the font in an input document (for example, the letter I in a sans-serif proportional font will be narrower than I in a monospaced serif font), but we do know the connectivity to a fairly high degree of accuracy, so we are coming up with a workaround to properly reproduce the spacing in the document.

Meanwhile the best way to retrieve the correct spacing is to use the text retrieval functions in the OCR SDK, namely OCR_GetText() and OCR_GetField() or OCR_GetFields(). If you use this method, which is the recommended method for retrieving plain text (rather than cutting and pasting from a searchable document), you will see that OCR does correctly capture text spacing. Meanwhile this does not impact searchable documents in our viewer as we ignore spaces, so that a search for "DESCRIPTION" will always match "DE SCRIP TION" or similar in the searchable PDF.
scdawson
User
Posts: 43
Joined: Thu Oct 20, 2011 3:40 pm

Re: OCR Library result tweaking

Post by scdawson »

Thank you for the update, Walter! I look forward to the new release in a couple of weeks, and I'll play around with this one in the meantime to see how good I can get the results with the assumption that the spacing issues will be resolved.

Thanks!

Shaun
Jamie - Tracker Supp
User
Posts: 191
Joined: Thu Jun 02, 2011 3:23 pm

Re: OCR Library result tweaking

Post by Jamie - Tracker Supp »

:D
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: OCR Library result tweaking

Post by Walter-Tracker Supp »

Just another quick update: the word spacing issue has been improved quite a bit now, to the point that the placement will match OCR results word for word (ie it preserves the formatting you would get from OCR_GetText()). The tradeoff is that sometimes virtual (invisible) character placements will not perfectly match that in the document (though they will still be close enough to be very significantly overlapped), but whole words end up matching even better than before and this will improve searchability with other viewers.

You can expect a revised release within a week.

We may add the option to choose your preferred placement method, either to match characters as closely as possible, or to ensure word spacing is maintained as accurately as possible. Both functionalities now exist in the code and it's just a matter of determining how to best add this option to the exposed part of the SDK.

In case I forgot to mention it, you can also expect an improvement in font size determination as well.
scdawson
User
Posts: 43
Joined: Thu Oct 20, 2011 3:40 pm

Re: OCR Library result tweaking

Post by scdawson »

That's fantastic, Walter. Thank you for the update!

Shaun
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: OCR Library result tweaking

Post by Walter-Tracker Supp »

:)
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: OCR Library result tweaking

Post by Walter-Tracker Supp »

Just an update; the new build which fixes word connectivity is now up. :)
scdawson
User
Posts: 43
Joined: Thu Oct 20, 2011 3:40 pm

Re: OCR Library result tweaking

Post by scdawson »

Hello, Walter!

Thanks for the update. I've downloaded and installed the latest version, put the new ocrtools.dll in the 'Debug' directory of my VS2010 project, and run my project, but I don't seem to get any OCR. At least, I can't select or find using Acrobat.

The file size is affected the same way it would be if it DID do OCR (I think), but I can't interact with the OCR'd text.

Your sample file attached.

Any ideas?

Shaun
Attachments
_ocr_output.zip
(1.06 MiB) Downloaded 281 times
scdawson
User
Posts: 43
Joined: Thu Oct 20, 2011 3:40 pm

Re: OCR Library result tweaking

Post by scdawson »

Incidentally, when I revert back to the original ocrtools.dll, I get the same behavior, so whatever happened is persistent across multiple ocrtools.dll. I'll do further investigation to rule out my program and environment and keep you posted.
scdawson
User
Posts: 43
Joined: Thu Oct 20, 2011 3:40 pm

Re: OCR Library result tweaking

Post by scdawson »

I take that back. I'm pretty confident that even though I thought I rolled back to the original .dll, I was actually using the new one. Now that I've installed the new version, I can't get a program compiled with the new .libs to successfully call the Init function without an ESP error about an invalid calling convention.

When I run your OCR_TEST program using the new .dll, I get the same behavior as reported previously. When I try to revert to the old .dll, I get the ESP error.

Shaun
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: OCR Library result tweaking

Post by Walter-Tracker Supp »

I'm looking into this. Did you recompile with the latest headers and link the latest .lib file?
scdawson
User
Posts: 43
Joined: Thu Oct 20, 2011 3:40 pm

Re: OCR Library result tweaking

Post by scdawson »

I believe so. I installed the new .exe over the previous installation, and am compiling against the copies in the installation directory. Is there any way to make absolute sure?
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: OCR Library result tweaking

Post by Walter-Tracker Supp »

Thank you for informing us of this issue. I have reproduced it with the files included in the installer and will be releasing a fix shortly. If you need it ASAP, please email support@pdf-xchange.com and we will send you the updated package as soon as it is available.

Thank you for your patience.

-Walter
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: OCR Library result tweaking

Post by Walter-Tracker Supp »

The problem stems from an incorrect english language OCR data file that was inadvertently included in the distribution. A new language file will be available on the website shortly, and I am sending you one via email.

We apologize for this inconvenience.

-Walter
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: OCR Library result tweaking

Post by Walter-Tracker Supp »

In the meantime I have attached the correct English language data file.

We will be checking the validity of all of the language data files in the next hour and placing new ones on the website for download.

-Walter
Attachments
english.zip
(1.09 MiB) Downloaded 236 times
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: OCR Library result tweaking

Post by Walter-Tracker Supp »

There will also be a rebuild of the DLL available sometime this weekend (version 1.0.3), which catches a related issue that could occur with other language files.
Post Reply