OCR hangs forever ...

Discussion for the End User use of OCR in PDF-XChange Editor and Viewer

Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan

Post Reply
guebert
User
Posts: 151
Joined: Sun Apr 06, 2008 7:05 pm

OCR hangs forever ...

Post by guebert »

Hallo,

you can find a PDF attached, where the OCR does not come to an end.

Any idea?

Michael
Attachments
OCR Hang1.zip
(385.98 KiB) Downloaded 260 times
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: OCR hangs forever ...

Post by Walter-Tracker Supp »

I am unable to reproduce the behaviour you describe (hanging) with build 201 of the viewer, although it did take awhile to complete, and OCR only detects the clear text in the header and title of the page (against the white background).

The reason for this is the heavy speckling of the background which is too large to be removed with the pre-OCR despeckling algorithm, and the dots are numerous enough to confuse the OCR engine into attempting to recognize them as characters. This is a common limitation of OCR and certainly not limited to our product (in fact I tested your document with some other OCR engines and they also experience difficulties recognizing that text).

There is a potential workaround for this, but because we are focused on putting new features into the new major revision there is a good possibility it will not be made available until that release in a couple of months' time.

-Walter
Attachments
OCR Hang1-out.pdf
(636.48 KiB) Downloaded 355 times
guebert
User
Posts: 151
Joined: Sun Apr 06, 2008 7:05 pm

Re: OCR hangs forever ...

Post by guebert »

Walter-Tracker Supp wrote:I am unable to reproduce the behaviour you describe (hanging) with build 201 of the viewer, although it did take awhile to complete, and OCR only detects the clear text in the header and title of the page (against the white background).
In your document the headline in front of white background is also not detected. So maybe the very big fontsize is a (the?) problem?!

Michael
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17824
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: OCR hangs forever ...

Post by Tracker Supp-Stefan »

Hi Michael,

In "Low" accuracy the big heading is actually recognized, so it's not the font size but maybe the way the dark column on the left of it interferes with the way this line is recognized in "Medium" accuracy setting.

We will see if we can post any additional information on this case.

Best,
Stefan
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: OCR hangs forever ...

Post by Walter-Tracker Supp »

Hi Michael,

The first step in OCR is to analyze the page layout before attempting to recognize characters. Page layout analysis is never perfect - not even expensive, absolute top of the line document analysis suites get it right in all cases - but for most cases it is good enough (and with many simple to medium complexity documents it is, in fact, pretty much perfect). But in the document you posted, the structure could be considered very complex because of the stippled background - despite the simple layout of actual text on the page. If you zoom in on the page you can see quite clearly how complex the actual image is. Thus there are errors in page layout analysis which ultimately stem from this complexity.

Of course we are always working on improvements in all aspects of our products, but with OCR there are always limitations. We are endeavouring to remove as many of those limitations as possible, of course :)

We do have a method that can improve results with documents like this (the application of slight blur to the document) but in all likelihood it will not be added to the current PDF Viewer (version 3 is our focus for new features). For now you can use it in our developer's SDK (option flag: OCR_Image_GaussianBlur or OCR_Image_EdgeRefine), if you have that.

-Walter


Tracker Supp-Stefan wrote:Hi Michael,

In "Low" accuracy the big heading is actually recognized, so it's not the font size but maybe the way the dark column on the left of it interferes with the way this line is recognized in "Medium" accuracy setting.

We will see if we can post any additional information on this case.

Best,
Stefan
Post Reply