Controlling OCR Results

PDF-X OCR SDK is a New product from us and intended to compliment our existing PDF and Imaging Tools to provide the Developer with an expanding set of professional tools for Optical Character Recognition tasks

Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Tracker Supp-Stefan

Post Reply
aitchisj
User
Posts: 47
Joined: Mon Apr 04, 2011 4:44 am

Controlling OCR Results

Post by aitchisj »

Hi There,

My company is using the PDF-XChange Viewer SDK ActiveX version 2.5.201 within our software. We have been using the OCR feature but have been running into accuracy problems with the resulting OCR Text and wonder if we can limit/tweak/control the process in any way to provide a better result.

To illustrate an example of the problem we're experiencing, I've attached a PDF document which we're running through the OCR feature using medium accuracy and English language settings.

Here is an excerpt of the result it produces...
The Sequoia C512 system will provide us with maximized diagnostic information and increased
exam efficiency through significantly advanced acoustic imaging
looks good, but upon further examination, the word "significantly" actually contains a strange unicode character where the 'f' and the 'i' are combined as one: 'fi'

This is problematic for us because we copy the text out of the PDF and place it into a report which is subsequently spell checked and causes a nightmare for some of our users. Is there anything we can do to control the character set which is used to produce the resulting OCR text, say by somehow limiting it to ASCII characters only? Is there any other control over the process which I'm not considering that could help my situation?

Thanks in advance for any help.
-John
Attachments
001.pdf
Sample OCR Document
(62.53 KiB) Downloaded 220 times
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17889
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: Controlling OCR Results

Post by Tracker Supp-Stefan »

Hello John,

These special characters are called ligatures, and I do not believe you can control how these are used in the current OCR tool - but will ask our OCR SDK developers for a further comment in here.

As a side note - I tried the OCRed text in MS Office Word 2010 - and its spell checker didn't complain at all for this ligature.

Best,
Stefan
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: Controlling OCR Results

Post by Walter-Tracker Supp »

The OCR engine is currently trained to recognize a wide variety of standard ASCII and unicode characters, and while there are cases where these ligatures are undesirable, overall we felt it wouldn't be appropriate to impose limitations in the viewer (which, by design, gives a standard straight-forward interface to the OCR function).

The ocrtools.dll library (available in PDF XChange PRO SDK) provides a character blacklist / whitelist feature, but this is not available in the viewer (or Active-X viewer) at the moment. You can expect these features to be available with the end user viewer and viewer API in the future.

However, I would warn you that in the case of ligatures this may not provide a perfect solution, as identification is very dependent on the "chopping" step of recognition where the document image is segmented into individual characters. For the "fi" ligature, it is likely that the page layout analysis determined "fi" was a single character (perhaps because it was ligated in the original text), so the unicode ligature was the best match. Imposing an "fi" blacklist would probably cause it to be recognized as some other (wrong) single character, like "b" or "h". Unfortunately this is just one of the costs of current generation OCR, which is never perfect (neither for us nor our competitors). If you OCR documents with this font often, perhaps you can come up with a workaround when you extract the text, to manually replace the ligatures afterwards (i.e., search extracted text for the (fi) character and replace with (f)(i)).
aitchisj
User
Posts: 47
Joined: Mon Apr 04, 2011 4:44 am

Re: Controlling OCR Results

Post by aitchisj »

Gentlemen,

Appreciate the quick responses (as usual). While your Microsoft Word spell checker might accept the word despite the ligatures, our spell checker isn't so robust I'm afraid to say. It's unfortunate that I'd have to replace the ligatures like what Walter suggests with the characters I want them to be, such as (f),(i) in the case of (fi); however, it's doable. I've seen other ligatures as well that are produced by the OCR although I can't think of them all off hand, so the question becomes: can you direct me to a resource where I may find a list of ligatures that the OCR tool will produce so that we can program these replacements?

Not sure if my vote counts much (or if I even have a vote :)), but it would definitely be nice to be able to limit the characters (using blacklists/whitelists or whatever) that are produced by the control, so I'd like to place a vote on that feature!
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: Controlling OCR Results

Post by Walter-Tracker Supp »

Gentlemen,

Appreciate the quick responses (as usual). While your Microsoft Word spell checker might accept the word despite the ligatures, our spell checker isn't so robust I'm afraid to say. It's unfortunate that I'd have to replace the ligatures like what Walter suggests with the characters I want them to be, such as (f),(i) in the case of (fi); however, it's doable. I've seen other ligatures as well that are produced by the OCR although I can't think of them all off hand, so the question becomes: can you direct me to a resource where I may find a list of ligatures that the OCR tool will produce so that we can program these replacements?
Sure, although the list is sort of fragmented.

The main ones are listed in this table:

http://en.wikipedia.org/wiki/List_of_pr ... d_digraphs


But there may be a few more found on the main page here:
http://en.wikipedia.org/wiki/Latin_char ... in_Unicode

Specifically, check the following sections:

Section "Alphabetic presentation forms" on the main page (hyperlink redirects; just search the page): http://en.wikipedia.org/wiki/Latin_char ... in_Unicode

Section "Latin Extended A" (e.g. ij and oe) on the main page, or here ---> http://en.wikipedia.org/wiki/Latin_Extended-A

Section "Latin Extended B" (e.g. dz and lj) on the main page, or here --> http://en.wikipedia.org/wiki/Latin_Extended-B



Not sure if my vote counts much (or if I even have a vote :)), but it would definitely be nice to be able to limit the characters (using blacklists/whitelists or whatever) that are produced by the control, so I'd like to place a vote on that feature!
Will keep it in mind :)
aitchisj
User
Posts: 47
Joined: Mon Apr 04, 2011 4:44 am

Re: Controlling OCR Results

Post by aitchisj »

Thanks a lot. "I can hardly wait to start programming these in there..." he says sarcastically :D .
No worries, I'm armed with more answers now so this definitely helps.

Cheers,
-John 8)
User avatar
Paul - Tracker Supp
Site Admin
Posts: 6894
Joined: Wed Mar 25, 2009 10:37 pm
Location: Chemainus, Canada
Contact:

Re: Controlling OCR Results

Post by Paul - Tracker Supp »

:)
Best regards

Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: Controlling OCR Results

Post by Walter-Tracker Supp »

You may also want to check the PDF links on the same main wikipedia page, which links to the specification documents from unicode.org.

http://en.wikipedia.org/wiki/Latin_char ... in_Unicode

In these documents, ligatures seem to all be denoted as such, so you can search for the word "ligature". However, they are spread across multiple documents (Latin Extended A and B, Latin Ligatures, etc).
Post Reply