Searching OCRed text in Adobe Reader

Discussion for the End User use of OCR in PDF-XChange Editor and Viewer

Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan

Post Reply
mcjtom
User
Posts: 2
Joined: Sun Dec 23, 2012 3:09 am

Searching OCRed text in Adobe Reader

Post by mcjtom »

Hi,

I successfully OCR'ed PDF in Xchange Viewer - now I can select, copy and search text.

When the same document is opened in Adobe Reader X, the text is there - it can be highlighted and copied, but a search for anything returns no results (as if the text wasn't there).

Related question: is there a way of checking how much text the document actually contains (e.g. word or character count)?

Cheers!
User avatar
Paul - Tracker Supp
Site Admin
Posts: 6837
Joined: Wed Mar 25, 2009 10:37 pm
Location: Chemainus, Canada
Contact:

Re: Searching OCRed text in Adobe Reader

Post by Paul - Tracker Supp »

Hi mcjtom

the text should be searchable in Adobe. Can you post the document for us to look at?

regards
Best regards

Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
mcjtom
User
Posts: 2
Joined: Sun Dec 23, 2012 3:09 am

Re: Searching OCRed text in Adobe Reader

Post by mcjtom »

Thanks. I can't post the document because is some 100Mb and also somewhat sensitive. I'm not sure what else to do to show you the problem.

The document is searchable in Xchange, and some others (e.g. Sumatra), but not Adobe. Again, Adobe recognizes the text somehow (one can highlight and copy it), but the search returns nothing.

I tried to extract a single page from this document to show it, but the single page has no problem (i.e. it becomes Adobe searchable, while the whole document - some 500 pages - is not). Any suggestions?

Cheers!
User avatar
John - Tracker Supp
Site Admin
Posts: 5219
Joined: Tue Jun 29, 2004 10:34 am
Location: United Kingdom
Contact:

Re: Searching OCRed text in Adobe Reader

Post by John - Tracker Supp »

Not really - without seeing the document - its very hard to comment I am afraid.
If posting files to this forum - you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded - thank you.

Best regards
Tracker Support
http://www.tracker-software.com
MarkinAZ
User
Posts: 6
Joined: Sat May 04, 2013 2:03 am

Re: Searching OCRed text in Adobe Reader

Post by MarkinAZ »

Hi,

I also noticed that the OCR was searchable in PDF-XChange Viewer and saved the file. When it was opened in Adobe Reader only a few nonsense fragments were recognized.

I’ll zip the files for your review.

System: Windows XP Pro with SP3
Download URL: www.tracker-software.com/product/pdf-xchange-viewer
PDF-XChange Viewer: Portable Zip version 2.5.210 (Feb 25, 2013 – date from Help. Date on the URL-page above is 5 Mar 2013)
OCR_zip: ocrdats (dat file dates Dec 5, 2011)
Adobe Reader XI (11)

Hopefully there is a work around. I want to wean myself from years of using Acrobat.

MarkinAZ
Attachments
OCR-problem-info.zip
more info: see .txt file in zip
(2.73 MiB) Downloaded 236 times
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: Searching OCRed text in Adobe Reader

Post by Walter-Tracker Supp »

Hi,

Thanks very much for your samples. We appreciate this kind of feedback as it really helps us improve our products.

I have looked into this and have found the following:

First, I was not able to completely reproduce your problem - using the document you provided, in Adobe Reader XI (11.0.2) I was able to find 3 instances of "Motley" (although it missed the one in the image box, upper left corner of first page), and I did not get matches to the garbage text you noticed. Maybe you need to upgrade your Adobe software?

The bigger issue (the reason the second instance of "Motley" is missed by Adobe) is one of OCR layout analysis. Pages with complex text layout (like this one - with multiple text boxes, images with text overlaid on them, etc) present a few small problems to the OCR layout analysis engine implemented in our PDF-Viewer. Because text with complex layout can be placed in somewhat irregular ways, different search engines have different levels of success in finding things - partly because of things like spacing and alignment thresholds used to determine whether or not a series of letters on the page constitute a contiguous "word" or not.

Having said this, the new PDF-XChange Editor has a vastly improved OCR layout analyzer. After running your document through the Editor's OCR plugin I was able to find all instances of Motley, and the saved file was searchable in the Viewer, Editor, and competitor's products. Selection of text on the page is also much better in this iteration of our OCR offerings.

Hope this helps.

-Walter
MarkinAZ
User
Posts: 6
Joined: Sat May 04, 2013 2:03 am

Re: Searching OCRed text in Adobe Reader

Post by MarkinAZ »

Hi -

I checked, and Adobe 11.0.02 was installed for tests I ran. I have removed and re-installed it with identical results viewing the file with Adobe.

The PDF-XChange OCR does work well and the document is completely readable within PDF-XChange Viewer. It does a smart job of producing OCR on a scanned document with pictures and columns.

It is not clear whether a PDF-XChange document with OCR sent to others who use Adobe will get the correct search results or see the problems with somewhat random characters as indicated in the screen shot I forwarded.

Your test of the document found fewer problems than I found. Adobe 11.0.02 found 3 instances of motley (and not the odd characters I see), but it did not see all 8 instances identified by PDF-XChange Viewer.

I am not sure how to proceed next.

MarkinAZ
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: Searching OCRed text in Adobe Reader

Post by Walter-Tracker Supp »

Checking again, it looks like the document has two layers of invisible OCR text in it (one on top of the other). Maybe it had an extra OCR layer added by your scanning software, or perhaps it was OCR'd twice by using the "Preserve existing content" option in the viewer. This is probably what is causing all of your woes.

There are really only four cases of the word "Motley" that I can find in the text; the viewer finds two copies of each instance because of the doubled up text layers. I guess this problem is confusing Adobe somehow.

Can you go back to the source document and OCR it just once? Or, take this document and OCR it using the "Convert Page Content to Image Only" option to make sure the old text layers are removed?
MarkinAZ
User
Posts: 6
Joined: Sat May 04, 2013 2:03 am

Re: Searching OCRed text in Adobe Reader

Post by MarkinAZ »

When you mentioned running OCR only once that struck a chord, and I probably ran some files more than once.

I’ll include the scanned jpg files since they are only 1+ MB each. However, the OCR-image-only files are each 10.6+ MB so I’ll send them if you still want them.

The jpg files were converted using PDF-24 Creator / Editor v5.4.0, which has these selections: PDF1.2, PDF1.3, PDF1.4, PDF1.5, PDF/X-3, PDF/A-1 and PDF/A-2. There is no manual so I briefly reviewed information on Wikipedia (wiki/Portable_Document_Format), where there are many additional versions.

The scanned jpg files were converted using PDF1.4 and PDF1.5 because 1.4 was default and 1.5 was next to it. This isn’t very well chosen, but I don’t know enough to make an informed choice. I’ll need to explore the forums on this.

If you have a recommendation I’ll try it.

Here are attached files:
MF-1.jpg jpg of page 1
MF-2.jpg jpg of page 2
MF-1&2-jpg-to-pdf-1.4-to-ocr.pdf Converted with PDF-XChange Viewer portable v2.5.210
MF-1&2-jpg-to-pdf-1.5-to-ocr.pdf Converted with PDF-XChange Viewer portable v2.5.210

Care was taken to execute the OCR conversion only 1 time and saved once.

Results:
PDF-XChange Viewer
MF-1&2-jpg-to-pdf-1.4-to-ocr.pdf 4 instances of “motley” same as you described
MF-1&2-jpg-to-pdf-1.5-to-ocr.pdf 4 instances of “motley” same as you described
Adobe Reader 11.0.02
MF-1&2-jpg-to-pdf-1.4-to-ocr.pdf 0 instances of “motley”
MF-1&2-jpg-to-pdf-1.5-to-ocr.pdf 3 instances of “motley”

Looks like you’re working after hours to answer this – Thanks,

MarkinAZ

PS: Acrobat has an Optimize Scanned Document feature following OCR at reducing the file size.
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: Searching OCRed text in Adobe Reader

Post by Walter-Tracker Supp »

Hi MarkinAZ,

I did not get your attachments; make sure to zip them and attach the zip file. Our forum software removes some file extensions automatically to mitigate spamming.

The differences in search output is because words in a PDF document are not necessarily connected logically, the way they might be in a text file or Word document, so search engines also look at the placement of letters to decide which sets of letters make up "words". Since everybody (Adobe, us, other competitors) does this differently, results of searching will differ between products.

There are two ways to help with this. The first is to use the best OCR layout engine possible - the layout engine is the part that figures out how words, sentences, and paragraphs are connected on the page. We have added a greatly improved layout analyzer to our newest offering (the PDF-XChange Editor), and I have tried your document here and the results are a lot better. The second is to make sure documents are de-skewed (ie, level). Looks like your example document is good, but we also offer the Auto-Deskew option in the PDF-X Editor. Both of these things will improve layout, and placement of text into continuous words, and therefore searchability across products.

As for version numbers, for OCR it is really not too critical which PDF version number you create when scanning. PDF 1.4 or 1.5 are fine choices. I wouldn't recommend using PDF/A-1 and PDF/A-2, since these are restricted formats (the "A" stands for "Archiving", so a lot of operations are not allowed on these files in an effort to preserve the integrity of their contents). PDF/X also has some restrictions on it.

-Walter
MarkinAZ
User
Posts: 6
Joined: Sat May 04, 2013 2:03 am

Re: Searching OCRed text in Adobe Reader

Post by MarkinAZ »

Walter,

Thanks for the reply and info about OCR.

I missed the file too big message (5 MB max). This time 2 separate submissions,

This submission: OCR-problem-info-2a.zip

Mark
Attachments
OCR-problem-info-2a.zip
OCR-problem-info-2a.zip
(3.69 MiB) Downloaded 230 times
MarkinAZ
User
Posts: 6
Joined: Sat May 04, 2013 2:03 am

Re: Searching OCRed text in Adobe Reader

Post by MarkinAZ »

Walter -

2nd submission: OCR-problem-info-2b.zip

MarkinAZ
Attachments
OCR-problem-info-2b.zip
OCR-problem-info-2b.zip
(3.9 MiB) Downloaded 234 times
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: Searching OCRed text in Adobe Reader

Post by Walter-Tracker Supp »

Thanks for the files.

I can find 3 instances of "Motley" in both files using Adobe's reader - I'm not sure why you see differences.

As explained already, our Editor performs a better layout analysis on this particular document. I will attach the result in the next post.

Not sure what else I can say that I haven't said above... thanks for the supporting information though, we do appreciate it.

-Walter
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: Searching OCRed text in Adobe Reader

Post by Walter-Tracker Supp »

Here's the output using our new OCR layout engine from the Editor. Can you confirm that you can also find all 4 instances of "Motley" using your reader(s) of choice?

If you select text of this one, and the OCR output from the Viewer (provided by you already), you should be able to clearly see the improvement in the text layout from our new OCR version.

Nevertheless, in our testing (and that of our customers), for most documents the old OCR does a very good to excellent job and provides easy searching of text.

Hope this helps!

-Walter
Attachments
MotleyOutput.zip
(2.15 MiB) Downloaded 235 times
MarkinAZ
User
Posts: 6
Joined: Sat May 04, 2013 2:03 am

Re: Searching OCRed text in Adobe Reader

Post by MarkinAZ »

Walter,

Yes, very nice!

There were just the 4 instances of "motley" detected as desired.

Searches for other words also performed as expected. Also, no extraneous characters appeared in the searches - one of my original reasons for posting on this forum.

Bottom line - time to get the PDF-XChange Editor !!!

I take it this means to choose (referring to the product comparison page):
* PDF-Tools
* PDF-XChange Standard
* PDF-XChange Pro

Thanks for being so responsive. It's quite a confidence builder.

MarkinAZ
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: Searching OCRed text in Adobe Reader

Post by Walter-Tracker Supp »

MarkinAZ wrote:Walter,

Yes, very nice!

There were just the 4 instances of "motley" detected as desired.

Searches for other words also performed as expected. Also, no extraneous characters appeared in the searches - one of my original reasons for posting on this forum.

Bottom line - time to get the PDF-XChange Editor !!!

I take it this means to choose (referring to the product comparison page):
* PDF-Tools
* PDF-XChange Standard
* PDF-XChange Pro

Thanks for being so responsive. It's quite a confidence builder.

MarkinAZ
The free version of the editor is currently available for download, and it includes the OCR feature.

If you want to activate the "pro" features (document editing, etc), they are available to those who hold a current viewer, tools, or pro license. After we finalize a few features we will offer the editor (pro version) as a stand-alone product.

-Walter
Post Reply