Controlling OCR Results

aitchisj · Post by **aitchisj** » Wed Sep 26, 2012 7:08 am

Hi There,

My company is using the PDF-XChange Viewer SDK ActiveX version 2.5.201 within our software. We have been using the OCR feature but have been running into accuracy problems with the resulting OCR Text and wonder if we can limit/tweak/control the process in any way to provide a better result.

To illustrate an example of the problem we're experiencing, I've attached a PDF document which we're running through the OCR feature using medium accuracy and English language settings.

Here is an excerpt of the result it produces...

The Sequoia C512 system will provide us with maximized diagnostic information and increased
exam efficiency through signiﬁcantly advanced acoustic imaging

looks good, but upon further examination, the word "significantly" actually contains a strange unicode character where the 'f' and the 'i' are combined as one: 'ﬁ'

This is problematic for us because we copy the text out of the PDF and place it into a report which is subsequently spell checked and causes a nightmare for some of our users. Is there anything we can do to control the character set which is used to produce the resulting OCR text, say by somehow limiting it to ASCII characters only? Is there any other control over the process which I'm not considering that could help my situation?

Thanks in advance for any help.
-John

Wed Sep 26, 2012 11:13 am

Hello John,

These special characters are called ligatures, and I do not believe you can control how these are used in the current OCR tool - but will ask our OCR SDK developers for a further comment in here.

As a side note - I tried the OCRed text in MS Office Word 2010 - and its spell checker didn't complain at all for this ligature.

Best,
Stefan

Walter-Tracker Supp · Post by **Walter-Tracker Supp** » Wed Sep 26, 2012 3:46 pm

The OCR engine is currently trained to recognize a wide variety of standard ASCII and unicode characters, and while there are cases where these ligatures are undesirable, overall we felt it wouldn't be appropriate to impose limitations in the viewer (which, by design, gives a standard straight-forward interface to the OCR function).

The ocrtools.dll library (available in PDF XChange PRO SDK) provides a character blacklist / whitelist feature, but this is not available in the viewer (or Active-X viewer) at the moment. You can expect these features to be available with the end user viewer and viewer API in the future.

However, I would warn you that in the case of ligatures this may not provide a perfect solution, as identification is very dependent on the "chopping" step of recognition where the document image is segmented into individual characters. For the "fi" ligature, it is likely that the page layout analysis determined "fi" was a single character (perhaps because it was ligated in the original text), so the unicode ligature was the best match. Imposing an "fi" blacklist would probably cause it to be recognized as some other (wrong) single character, like "b" or "h". Unfortunately this is just one of the costs of current generation OCR, which is never perfect (neither for us nor our competitors). If you OCR documents with this font often, perhaps you can come up with a workaround when you extract the text, to manually replace the ligatures afterwards (i.e., search extracted text for the (fi) character and replace with (f)(i)).

aitchisj · Post by **aitchisj** » Wed Sep 26, 2012 4:26 pm

Gentlemen,

Appreciate the quick responses (as usual). While your Microsoft Word spell checker might accept the word despite the ligatures, our spell checker isn't so robust I'm afraid to say. It's unfortunate that I'd have to replace the ligatures like what Walter suggests with the characters I want them to be, such as (f),(i) in the case of (fi); however, it's doable. I've seen other ligatures as well that are produced by the OCR although I can't think of them all off hand, so the question becomes: can you direct me to a resource where I may find a list of ligatures that the OCR tool will produce so that we can program these replacements?

Not sure if my vote counts much (or if I even have a vote

), but it would definitely be nice to be able to limit the characters (using blacklists/whitelists or whatever) that are produced by the control, so I'd like to place a vote on that feature!

Walter-Tracker Supp · Post by **Walter-Tracker Supp** » Wed Sep 26, 2012 5:00 pm

Gentlemen,

Appreciate the quick responses (as usual). While your Microsoft Word spell checker might accept the word despite the ligatures, our spell checker isn't so robust I'm afraid to say. It's unfortunate that I'd have to replace the ligatures like what Walter suggests with the characters I want them to be, such as (f),(i) in the case of (fi); however, it's doable. I've seen other ligatures as well that are produced by the OCR although I can't think of them all off hand, so the question becomes: can you direct me to a resource where I may find a list of ligatures that the OCR tool will produce so that we can program these replacements?

Sure, although the list is sort of fragmented.

The main ones are listed in this table:

http://en.wikipedia.org/wiki/List_of_pr ... d_digraphs

But there may be a few more found on the main page here:
http://en.wikipedia.org/wiki/Latin_char ... in_Unicode

Specifically, check the following sections:

Section "Alphabetic presentation forms" on the main page (hyperlink redirects; just search the page): http://en.wikipedia.org/wiki/Latin_char ... in_Unicode

Section "Latin Extended A" (e.g. ij and oe) on the main page, or here ---> http://en.wikipedia.org/wiki/Latin_Extended-A

Section "Latin Extended B" (e.g. dz and lj) on the main page, or here --> http://en.wikipedia.org/wiki/Latin_Extended-B

Not sure if my vote counts much (or if I even have a vote ), but it would definitely be nice to be able to limit the characters (using blacklists/whitelists or whatever) that are produced by the control, so I'd like to place a vote on that feature!

Will keep it in mind

aitchisj · Post by **aitchisj** » Wed Sep 26, 2012 5:22 pm

Thanks a lot. "I can hardly wait to start programming these in there..." he says sarcastically

.
No worries, I'm armed with more answers now so this definitely helps.

Cheers,
-John

Post by **Paul - Tracker Supp** » Wed Sep 26, 2012 6:24 pm

Walter-Tracker Supp · Post by **Walter-Tracker Supp** » Wed Sep 26, 2012 6:56 pm

You may also want to check the PDF links on the same main wikipedia page, which links to the specification documents from unicode.org.

http://en.wikipedia.org/wiki/Latin_char ... in_Unicode

In these documents, ligatures seem to all be denoted as such, so you can search for the word "ligature". However, they are spread across multiple documents (Latin Extended A and B, Latin Ligatures, etc).

Controlling OCR Results

Controlling OCR Results

Re: Controlling OCR Results

Re: Controlling OCR Results

Re: Controlling OCR Results

Re: Controlling OCR Results

Re: Controlling OCR Results

Re: Controlling OCR Results

Re: Controlling OCR Results