Editing "Text Under Image" PDF Files

Forum for the PDF-XChange Editor - Free and Licensed Versions

Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan

Ned Nowotny
User
Posts: 1
Joined: Thu Apr 18, 2013 7:56 pm

Editing "Text Under Image" PDF Files

Post by Ned Nowotny »

When editing "Text Under Image" PDF files, the PDF-XChange Editor deletes the OCR produced hidden text object and replaces it with a text object to overlay the original scanned page image. This is never the desired result.

If the PDF-XChange Editor enabled easy review of OCR generated text hidden "under" the original scanned page image similar to the dialog box view provided by PCF-XChange Viewer while enabling edits of the OCR generated text in place so that OCR errors can be corrected, it could be the premiere application for OCR review. As it is, the Infix PDF editor is the best available tool for reviewing and editing OCR generated PDF files, but the overlay of the hidden OCR generated text on top of the original scanned page image can make it difficult to actually spot subtle OCR errors.
User avatar
Paul - Tracker Supp
Site Admin
Posts: 6901
Joined: Wed Mar 25, 2009 10:37 pm
Location: Chemainus, Canada

Re: Editing "Text Under Image" PDF Files

Post by Paul - Tracker Supp »

Hi Ned,

thanks for the post and welcome to the forum. I am not sure I fully understand the subtitles of the differences between the way the Viewer and the editor work with OCR documents. I have asked the OCR lead developer to take a look at this thread and let comment if he feels it necessary.

hth
Best regards

Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
BobM
User
Posts: 102
Joined: Thu Feb 14, 2008 11:28 am

Re: Editing "Text Under Image" PDF Files

Post by BobM »

Is there a chance of a comment from the lead developer?

The OCR capabilities of PDF Editor (build 301) are noticeably inferior to those of PDF Viewer. I've experienced missing and wrong characters, OCR'd documents that have a different orientation to the original and blank pages containing invisible text.
Although the blank pages issue is a known bug, and allegedly fixed in 302, it would be good to understand why there are differences in capability between the two applications, and when the Editor will achieve as good results as the Viewer.
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17948
Joined: Mon Jan 12, 2009 8:07 am
Location: London

Re: Editing "Text Under Image" PDF Files

Post by Tracker Supp-Stefan »

Hi Bob,

This is due to the transition and different internal structure of the two products. I assure you that we are working on improving the OCR in the Editor and making it even better than in the Viewer, and I will also pass this topic to Walter (our OCR lead dev) so that he can also post here any tips "from the kitchen" he might have.

Best,
Stefan
BobM
User
Posts: 102
Joined: Thu Feb 14, 2008 11:28 am

Re: Editing "Text Under Image" PDF Files

Post by BobM »

Thanks for the update Stefan.
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: Editing "Text Under Image" PDF Files

Post by Walter-Tracker Supp »

If you could provide an example document for us to examine we would appreciate it. In our testing, the OCR in the editor has much better layout analysis and generally gives higher quality results than the viewer. In particular, differentiation of text and image regions is much better and the overall quality of recognition is improved.

However, any examples that exhibit problems would be nice to see, so that we can fix undiscovered bugs or otherwise provide improvements. If you could attach something to this forum post we would love to see it, otherwise you can email examples to support@pdf-xchange.com for us to look at.

As for editing OCR'd documents, that's a separate issue related to returning text to invisible status after editing and I will defer that to one of the editing gurus.

Thanks!

-Walter
BobM
User
Posts: 102
Joined: Thu Feb 14, 2008 11:28 am

Re: Editing "Text Under Image" PDF Files

Post by BobM »

Walter,

I've attached a couple of examples:
1. The "source" archive contains a cropped scan that has been OCR'd by PDF Xchange Viewer & Editor (the latter as a new file). The select tool was activate for each file and all text from each file copied and pasted to a corresponding txt file. The txt file from PDF Viewer is significantly more readable, and contains words not present in the PDF Editor text file. Also, page 2 of the pdf produced by PDF Editor is completely blank.

2. The "emergency" archive is admittedly harder to OCR as it's in colour. Nevertheless, virtually no words are correctly identified by PDF Editor. Also, the portrait page has been rotated 90°but is still shown on in portrait causing part of the page to be truncated.
You do not have the required permissions to view the files attached to this post.
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: Editing "Text Under Image" PDF Files

Post by Walter-Tracker Supp »

Thanks; this appears to be a bug in the handling of certain types of page layout, not directly related to OCR but definitely having a big impact as it results in incorrect page orientations being passed to the OCR routines. It will most likely be addressed in the next build (probably a week or so away).

Thanks for bringing it to our attention.

-Walter
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: Editing "Text Under Image" PDF Files

Post by Walter-Tracker Supp »

The bug has been resolved and the fix be present in the next release version of the editor and OCR Plugin (next week, I believe, although I'm not responsible for the release schedule so I may defer to someone else to weigh in on that one).

However I would like to point out that your document "source.pdf" has an intrinsic problem for OCR, in that it is scanned as two pages with different tilts to them (ie, an open book). Auto-deskew treats the PDF page as a single document page. For the first page of source.pdf, this means that it ends up being corrected for the second page (right side of scan) but the first page (left side of scan) is therefore even more tilted. This results in a lot of OCR errors.

The best way to OCR text like this is to ensure that the pages are both level (or tilted the same amount), or even better, to scan the pages separately.

However, the major issues with weird layout have been corrected and emergency.pdf and source.pdf now OCR as expected.

-Walter
BobM
User
Posts: 102
Joined: Thu Feb 14, 2008 11:28 am

Re: Editing "Text Under Image" PDF Files

Post by BobM »

Walter - thanks for the feedback and clarification.
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17948
Joined: Mon Jan 12, 2009 8:07 am
Location: London

Re: Editing "Text Under Image" PDF Files

Post by Tracker Supp-Stefan »

:)
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: Editing "Text Under Image" PDF Files

Post by Walter-Tracker Supp »

BobM wrote:Walter - thanks for the feedback and clarification.
No problem!

Please note that the feedback I gave was more general; the problems you saw are directly related to the bug which has been resolved now (will be in the release available most likely by the end of today).