Editing "Text Under Image" PDF Files
Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan
-
- User
- Posts: 1
- Joined: Thu Apr 18, 2013 7:56 pm
Editing "Text Under Image" PDF Files
When editing "Text Under Image" PDF files, the PDF-XChange Editor deletes the OCR produced hidden text object and replaces it with a text object to overlay the original scanned page image. This is never the desired result.
If the PDF-XChange Editor enabled easy review of OCR generated text hidden "under" the original scanned page image similar to the dialog box view provided by PCF-XChange Viewer while enabling edits of the OCR generated text in place so that OCR errors can be corrected, it could be the premiere application for OCR review. As it is, the Infix PDF editor is the best available tool for reviewing and editing OCR generated PDF files, but the overlay of the hidden OCR generated text on top of the original scanned page image can make it difficult to actually spot subtle OCR errors.
If the PDF-XChange Editor enabled easy review of OCR generated text hidden "under" the original scanned page image similar to the dialog box view provided by PCF-XChange Viewer while enabling edits of the OCR generated text in place so that OCR errors can be corrected, it could be the premiere application for OCR review. As it is, the Infix PDF editor is the best available tool for reviewing and editing OCR generated PDF files, but the overlay of the hidden OCR generated text on top of the original scanned page image can make it difficult to actually spot subtle OCR errors.
-
- Site Admin
- Posts: 6901
- Joined: Wed Mar 25, 2009 10:37 pm
- Location: Chemainus, Canada
Re: Editing "Text Under Image" PDF Files
Hi Ned,
thanks for the post and welcome to the forum. I am not sure I fully understand the subtitles of the differences between the way the Viewer and the editor work with OCR documents. I have asked the OCR lead developer to take a look at this thread and let comment if he feels it necessary.
hth
thanks for the post and welcome to the forum. I am not sure I fully understand the subtitles of the differences between the way the Viewer and the editor work with OCR documents. I have asked the OCR lead developer to take a look at this thread and let comment if he feels it necessary.
hth
Best regards
Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
-
- User
- Posts: 102
- Joined: Thu Feb 14, 2008 11:28 am
Re: Editing "Text Under Image" PDF Files
Is there a chance of a comment from the lead developer?
The OCR capabilities of PDF Editor (build 301) are noticeably inferior to those of PDF Viewer. I've experienced missing and wrong characters, OCR'd documents that have a different orientation to the original and blank pages containing invisible text.
Although the blank pages issue is a known bug, and allegedly fixed in 302, it would be good to understand why there are differences in capability between the two applications, and when the Editor will achieve as good results as the Viewer.
The OCR capabilities of PDF Editor (build 301) are noticeably inferior to those of PDF Viewer. I've experienced missing and wrong characters, OCR'd documents that have a different orientation to the original and blank pages containing invisible text.
Although the blank pages issue is a known bug, and allegedly fixed in 302, it would be good to understand why there are differences in capability between the two applications, and when the Editor will achieve as good results as the Viewer.
-
- Site Admin
- Posts: 17948
- Joined: Mon Jan 12, 2009 8:07 am
- Location: London
Re: Editing "Text Under Image" PDF Files
Hi Bob,
This is due to the transition and different internal structure of the two products. I assure you that we are working on improving the OCR in the Editor and making it even better than in the Viewer, and I will also pass this topic to Walter (our OCR lead dev) so that he can also post here any tips "from the kitchen" he might have.
Best,
Stefan
This is due to the transition and different internal structure of the two products. I assure you that we are working on improving the OCR in the Editor and making it even better than in the Viewer, and I will also pass this topic to Walter (our OCR lead dev) so that he can also post here any tips "from the kitchen" he might have.
Best,
Stefan
-
- User
- Posts: 102
- Joined: Thu Feb 14, 2008 11:28 am
Re: Editing "Text Under Image" PDF Files
Thanks for the update Stefan.
-
- User
- Posts: 381
- Joined: Mon Jun 13, 2011 5:10 pm
Re: Editing "Text Under Image" PDF Files
If you could provide an example document for us to examine we would appreciate it. In our testing, the OCR in the editor has much better layout analysis and generally gives higher quality results than the viewer. In particular, differentiation of text and image regions is much better and the overall quality of recognition is improved.
However, any examples that exhibit problems would be nice to see, so that we can fix undiscovered bugs or otherwise provide improvements. If you could attach something to this forum post we would love to see it, otherwise you can email examples to support@pdf-xchange.com for us to look at.
As for editing OCR'd documents, that's a separate issue related to returning text to invisible status after editing and I will defer that to one of the editing gurus.
Thanks!
-Walter
However, any examples that exhibit problems would be nice to see, so that we can fix undiscovered bugs or otherwise provide improvements. If you could attach something to this forum post we would love to see it, otherwise you can email examples to support@pdf-xchange.com for us to look at.
As for editing OCR'd documents, that's a separate issue related to returning text to invisible status after editing and I will defer that to one of the editing gurus.
Thanks!
-Walter
-
- User
- Posts: 102
- Joined: Thu Feb 14, 2008 11:28 am
Re: Editing "Text Under Image" PDF Files
Walter,
I've attached a couple of examples:
1. The "source" archive contains a cropped scan that has been OCR'd by PDF Xchange Viewer & Editor (the latter as a new file). The select tool was activate for each file and all text from each file copied and pasted to a corresponding txt file. The txt file from PDF Viewer is significantly more readable, and contains words not present in the PDF Editor text file. Also, page 2 of the pdf produced by PDF Editor is completely blank.
2. The "emergency" archive is admittedly harder to OCR as it's in colour. Nevertheless, virtually no words are correctly identified by PDF Editor. Also, the portrait page has been rotated 90°but is still shown on in portrait causing part of the page to be truncated.
I've attached a couple of examples:
1. The "source" archive contains a cropped scan that has been OCR'd by PDF Xchange Viewer & Editor (the latter as a new file). The select tool was activate for each file and all text from each file copied and pasted to a corresponding txt file. The txt file from PDF Viewer is significantly more readable, and contains words not present in the PDF Editor text file. Also, page 2 of the pdf produced by PDF Editor is completely blank.
2. The "emergency" archive is admittedly harder to OCR as it's in colour. Nevertheless, virtually no words are correctly identified by PDF Editor. Also, the portrait page has been rotated 90°but is still shown on in portrait causing part of the page to be truncated.
You do not have the required permissions to view the files attached to this post.
-
- User
- Posts: 381
- Joined: Mon Jun 13, 2011 5:10 pm
Re: Editing "Text Under Image" PDF Files
Thanks; this appears to be a bug in the handling of certain types of page layout, not directly related to OCR but definitely having a big impact as it results in incorrect page orientations being passed to the OCR routines. It will most likely be addressed in the next build (probably a week or so away).
Thanks for bringing it to our attention.
-Walter
Thanks for bringing it to our attention.
-Walter
-
- User
- Posts: 381
- Joined: Mon Jun 13, 2011 5:10 pm
Re: Editing "Text Under Image" PDF Files
The bug has been resolved and the fix be present in the next release version of the editor and OCR Plugin (next week, I believe, although I'm not responsible for the release schedule so I may defer to someone else to weigh in on that one).
However I would like to point out that your document "source.pdf" has an intrinsic problem for OCR, in that it is scanned as two pages with different tilts to them (ie, an open book). Auto-deskew treats the PDF page as a single document page. For the first page of source.pdf, this means that it ends up being corrected for the second page (right side of scan) but the first page (left side of scan) is therefore even more tilted. This results in a lot of OCR errors.
The best way to OCR text like this is to ensure that the pages are both level (or tilted the same amount), or even better, to scan the pages separately.
However, the major issues with weird layout have been corrected and emergency.pdf and source.pdf now OCR as expected.
-Walter
However I would like to point out that your document "source.pdf" has an intrinsic problem for OCR, in that it is scanned as two pages with different tilts to them (ie, an open book). Auto-deskew treats the PDF page as a single document page. For the first page of source.pdf, this means that it ends up being corrected for the second page (right side of scan) but the first page (left side of scan) is therefore even more tilted. This results in a lot of OCR errors.
The best way to OCR text like this is to ensure that the pages are both level (or tilted the same amount), or even better, to scan the pages separately.
However, the major issues with weird layout have been corrected and emergency.pdf and source.pdf now OCR as expected.
-Walter
-
- User
- Posts: 102
- Joined: Thu Feb 14, 2008 11:28 am
Re: Editing "Text Under Image" PDF Files
Walter - thanks for the feedback and clarification.
-
- Site Admin
- Posts: 17948
- Joined: Mon Jan 12, 2009 8:07 am
- Location: London
-
- User
- Posts: 381
- Joined: Mon Jun 13, 2011 5:10 pm
Re: Editing "Text Under Image" PDF Files
No problem!BobM wrote:Walter - thanks for the feedback and clarification.
Please note that the feedback I gave was more general; the problems you saw are directly related to the bug which has been resolved now (will be in the release available most likely by the end of today).