Searching multiple pdf files for OCR generated text

PDF-XChange Viewer SDK
Simple DLL and ActiveX

Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Sean - Tracker, Chris - Tracker Supp, Tracker Supp-Stefan

Post Reply
fiscal
User
Posts: 138
Joined: Fri Aug 06, 2004 12:09 am
Contact:

Searching multiple pdf files for OCR generated text

Post by fiscal » Wed Jun 20, 2012 8:52 am

Hi,

What do you suggest, if it is possible that is, to search though multiple "scanned" pdf files (in the background) assuming that each file has already been run through the OCR routine.

I want to filter a list of all files (located in a browse) that return a TRUE value when searched for a particular item of text.

Thanks

Tony

Tracker - Clarion Support
Site Admin
Posts: 1412
Joined: Wed Jun 30, 2004 4:45 pm
Location: Maryland, USA
Contact:

Re: Searching multiple pdf files for OCR generated text

Post by Tracker - Clarion Support » Wed Jun 20, 2012 11:51 am

Hi Tony!

If you want to use PDF-Tools to search, then you'll have to use the Text Extraction functions. I haven't tried to use these on a document containing OCR-generated text, but they should work. I'll check to see if I thwere's any problems with extracted text from and OCR document. Also, Bob Roos <spudman@wybatap.com> has done extensive work in the PDF-Tools Text Extraction arena and you might want to touch base with him about how to handle the extracted text elements.

However, you might find that using the PDF-Viewer ActiveX in "hidden" mode might work better, as I believe thet some of the rough spots in text search have been smoothed out so that if you are searching for "quick brown fox" it should find it, no matter how the text is arranged internally in the PDF.

Later: After doing some tests I think that the Viewer ActiveX is easiest to use. However there's a "gotcha" you might want to be careful about.

No OCR is 100%. However, the way our OCR works on rasterized pages might make you think that it is. The reason is that we do not substitute the interpreted text for the page graphics, but instead the text fields are created invisibly under the graphic text!

When I went to try a test using the sample "Little House" PDF I saw the phrase "chopping block." When I went to search for "chopping block," the Viewer did not find it. When I copied the text out and pasted it visibly in my text editor the reason was quite obvious: what I was searching for was actually "ehopping block!" When i searched for that, the Viewer text search found it easily.
Craig Ransom
Tracker Software - Clarion Support
http://www.tracker-software.com

Post Reply