cropped scanned pdf problem

Discussion for the End User use of OCR in PDF-XChange Editor and Viewer

Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan

Post Reply
Ricaz
User
Posts: 22
Joined: Sun Oct 30, 2011 11:18 am

cropped scanned pdf problem

Post by Ricaz »

Hi there!
First of all, thanks a lot for the OCR function, it is a wonderful surpise, to have such a good function for free!
I am trying to OCR a file I scanned with a photocopying machine (pdf directly into the usb stick).

I've tried first the "preserve original content and add text layer" function, but there was no result, as far as I could understand.

then, I've tried the "convert page content to image only - add text as a layer" function.
At this point, the OCRed text was recognised and added as a transparent layer to the pdf. it can be searched, highlighted, selected, copied and so on...
the problem is that in the process the original pdf image has been moved leftwards, and now half of it lies beyond the left border of the page, being unreadable. This means that the OCRed "invisible" text is actually on the place it should be, but the original text is somewhere else.

I try to attach the image of the resulting OCRed pdf:
Image

I guess that this is due to the fact I had cropped the pdf after scanning it with the photocopier... to crop it, I used BRISS (BRigt Snippet Sire... do you happen to know it?). It seems therefore, that when the OCR function processes the cropped pdf, it moves the scanned image to the 0,0 coordinates of the primitive, uncropped pdf (which means that now the scanned image lies beyond the visible limits of the now cropped file...

I don't know if it is only me having this problem, but I guess it could be useful to let you know about it, just for future developments...

thank you very much... by the way, I selected and copied the invisible OCRed text, and it seems the recognition works pretty well...

Ricaz
Ricaz
User
Posts: 22
Joined: Sun Oct 30, 2011 11:18 am

Re: cropped scanned pdf problem

Post by Ricaz »

of course, the image did not work...
here is the link

https://picasaweb.google.com/1126646376 ... n-aVt8CFNg
User avatar
Paul - Tracker Supp
Site Admin
Posts: 6829
Joined: Wed Mar 25, 2009 10:37 pm
Location: Chemainus, Canada
Contact:

Re: cropped scanned pdf problem

Post by Paul - Tracker Supp »

Hi

thanks for that.

I'm not 100% sure what has happened here. I suspect it is related to the crop that was performed prior to the OCR. A crop by definition in a PDF redefines the visible area of a PDF, it doesn't actually remove the cropped content. Would that seem consistent with what you are seeing? The best thing you can do here is send us the original PDF, preferably pre and post crop. If you do send the PDFs for us to look at then be sure to zip them in an archive ir, like your image, the forum software will strip the attachment.

While on the subject of posting attachments, since you have that image on a publicly accessible URL you can simply use the BBCode 'Img' tag to use it in your post directly like this:

Code: Select all

[img]https://picasaweb.google.com/112664637657417330279/RecentlyUpdated?authkey=Gv1sRgCM-un-aVt8CFNg[/img]
Image
Best regards

Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
Ricaz
User
Posts: 22
Joined: Sun Oct 30, 2011 11:18 am

Re: cropped scanned pdf problem

Post by Ricaz »

Dear Paul,
thanks for your reply.
I guess the cropping is indeed the reason of the problem.
I tried with other files, and it worked when they had not been previously cropper.
I tried to first OCR one file, and only then crop it... and it works perfectly.

if you think it is useful, I attach some files:
goldman: the original, uncropped unOCRed
goldman_cropped: only cropped, not OCRed
goldman_cropped_OCRmode2: cropped first, and only then OCRed... it has the problem of the movement of the image
goldman_mode2_72: only OCRed, with the "convert page content to image only" mode
goldman_mode2_72_cropped: the file above, but cropped after the OCR, and this is good...

as I said, I used BRISS to crop the files.
As far as you know, is there any way to recover the "visible area" of cropped pdf files whose originals have been deleted?
Another question: why does the "preserve original content and add text layer" function does not do anything (at least anything I can see)?
Attachments
CroppedPdf.zip
files cropped and OCRed
(2.03 MiB) Downloaded 487 times
User avatar
Paul - Tracker Supp
Site Admin
Posts: 6829
Joined: Wed Mar 25, 2009 10:37 pm
Location: Chemainus, Canada
Contact:

Re: cropped scanned pdf problem

Post by Paul - Tracker Supp »

Hi again Ricaz,

thanks for the samples.
As far as you know, is there any way to recover the "visible area" of cropped pdf files whose originals have been deleted?
I will need to discuss this with one of my dev team as I do not know how to recover that cropped area if possible. Given that the content is still there and it s just the viewable area that is redefined I suspect this may be possible, I just need to find out how/if it can be done. It may not be possible with the end user tools.
Another question: why does the "preserve original content and add text layer" function does not do anything (at least anything I can see)?
If a PDF is a combination of text and images this option will leave the text as is and only OCR the images. This is usually the best method to use as it preserves what is already known about that text. The option to 'Convert Page Content to Image only - Add Text as a Layer' will discard what is already known about text, convert the entire document to an image then OCR the entire documnet. This may be preferable to some users but comes at the price of possible OCR errors in text that was already known so if you don't need this option I suggest using 'Preserve Original Content & Add Text Layer'

hth
Best regards

Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
Ricaz
User
Posts: 22
Joined: Sun Oct 30, 2011 11:18 am

Re: cropped scanned pdf problem

Post by Ricaz »

Dear Paul,

just to be sure I got it..

In your last post you answered to my question...

Quote:
Question: why does the "preserve original content and add text layer" function does not do anything (at least anything I can see)?
Your Answer: If a PDF is a combination of text and images this option will leave the text as is and only OCR the images.

When I tried to use it on "images only" pdfs (like the original of the one I attached), I had no result whatsoever: is this normal?


Isn't there any way to have a OCR transparent layer added WITHOUT re-sampling the original pdf image? the "convert page content to image only" of course does convert the image (and it asks for the image quality)... this has a heavy cost in terms of dimension: as with the small samples I attached earlier, I went from an NON-OCRed original 310 kbites (very good image quality, though) to an OCRed 530 kb file, with the lowest resolution (72) which resulted in a very bad image quality. I guess that increasing the replaced image quality to, say, 200 instead of 72 would mean to have a final file of four times the original dimension.

Sorry if I don't get it...

Thank you very much!

Ricaz
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17820
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: cropped scanned pdf problem

Post by Tracker Supp-Stefan »

Hello Ricaz,

We will investigate this issue, as it should be both possible to just add the new invisible layer with the first OCR option (which doesn't work with your sample), as well as the shifting of the image.

As it's the week between Christmas and New Year - this might not be investigated today - but we will get to it as soon as possible and update this topic when we have any more news.

Best,
Stefan
User avatar
Paul - Tracker Supp
Site Admin
Posts: 6829
Joined: Wed Mar 25, 2009 10:37 pm
Location: Chemainus, Canada
Contact:

Re: cropped scanned pdf problem

Post by Paul - Tracker Supp »

Hi Ricaz,

sorry to have to tell you but the OCR specialist who would look at this post is on holidays until next week. Will it be OK for you to wait for his return?

regards
Best regards

Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
Ricaz
User
Posts: 22
Joined: Sun Oct 30, 2011 11:18 am

Re: cropped scanned pdf problem

Post by Ricaz »

HI there!
Mates, I'm not in a rush... there is no problem at all (I should be writing my thesis, so that's better actually)... let him enjoy his holiday :)
thank you very much for your attention, though!

Ricaz
Jamie - Tracker Supp
User
Posts: 191
Joined: Thu Jun 02, 2011 3:23 pm

Re: cropped scanned pdf problem

Post by Jamie - Tracker Supp »

Thanks for your patience!

Regards,
Jamie
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: cropped scanned pdf problem

Post by Walter-Tracker Supp »

As I understand it, there are two issues here?

1. Image placement in cropped PDFs is incorrect after OCR.
2. Preserve original content is not working for you with an image-only PDF (different PDF, not cropped?)

Thank you,
Walter
Ricaz
User
Posts: 22
Joined: Sun Oct 30, 2011 11:18 am

Re: cropped scanned pdf problem

Post by Ricaz »

Hi Walter,
thanks for your post.
yes, there are two issues, and you got them right... anyway, I try to summarize...

1. Image placement in cropped PDFs is incorrect after OCR.
if the pdf has been previously cropped, when the OCR is made with the "Convert Page Only to Image Content" mode the invisible text layer is placed in the right place,

but the visible re-created image layer is displaced, I guess because its coordinates 0,0 are reset according to the pre-crop pdf page dimensions. Therefore, since I

normally crop the left part of the page, the recreated image is moved leftwards. there is a snapshot of this "wrong" outcome above.
However, if first I make the OCR with the "Convert Page Only to Image Content" mode, and only after that I crop, the file is ok.
Is there any way to get a correct result for pdfs whose un-cropped original has been deleted? maybe a kind of undo-crop? or making the OCR understand that the original

image has been cropped?

2. Second point:
Preserve original content is not working for you with an image-only PDF.
In your reply you added: "different PDF, not cropped"... actually, it does not seem to work with any kind of scanned pdf...

Actually I found out -- JUST NOW -- that this "preserve original" mode works on a particular kind of files, ie pdf files I created from pictures with "bullzip pdf printer" (with pictures I literally mean photographies of book pages, taken with a photocamera). There is just a small, quite funny problem, ie the orientation of the words in the invisible layer... see later and attachment).

Seeing that it works with pdfs made from pictures, I just tried it with a pdf which I downloaded, and which was originally not searchable... The OCR works perfectly even here...
Therefore, I guess the problem I had was in the nature of the pdf I was trying to OCR. I created those pdfs with a photocopying machine which scans and saves as pdf directly into the usb stick: with the files created in this way (both cropped and uncropped) the OCR with "preserve original content" does not work at all. If I OCR one of these pdfs, there is no change (no search, no selection); [however, if I save the pdf after the OCR, the file dimension has increased of roughly a third].

... however, I just tried to "print" one of this "photocopying-machine-scanned pdfs" with "bullzip pdf printer"... if I OCR the re-printed file it works lol


However, there is the same, quite funny problem mentioned above for the "picture-based pdf". Individual words are a (sort of) rotated 90 degrees clockwise... (see attachment, it is rather difficult to explain).... this issue affects fonts based tools (highlight, underline), and selection, but search function is fine (and that's already a massive improvement, for me...)

Now, I'll try tomorrow to find out what is the brand and model of the photocopying machine I used...are you interested to know about it?
in the meanwhile I attach a sample of a file scanned with that machine, reprinted with "Bullzip pdf printer" and only then OCRed...
Attachments
godmanBULLZIP2.7z
(382.21 KiB) Downloaded 470 times
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: cropped scanned pdf problem

Post by Walter-Tracker Supp »

Thank you; part of this relates to another issue that we are in the process of fixing, but we will investigate the other issues here and provide fixes as soon as possible.

Thanks for letting us know and we will update you as we work on this.
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: cropped scanned pdf problem

Post by Walter-Tracker Supp »

An internal ticket has been created to track this issue (Ticket #1396).

-Walter
Ricaz
User
Posts: 22
Joined: Sun Oct 30, 2011 11:18 am

Re: cropped scanned pdf problem

Post by Ricaz »

Sorry to open this discussion again... just asking whether you are going to post here when this issue is dealt with...
thanks!
User avatar
Paul - Tracker Supp
Site Admin
Posts: 6829
Joined: Wed Mar 25, 2009 10:37 pm
Location: Chemainus, Canada
Contact:

Re: cropped scanned pdf problem

Post by Paul - Tracker Supp »

Hi Ricaz,

no need to apologize! Yes - this forum thread is referenced in the Support Ticket so once a resolution is found we will update you here on this thread.

Looking at the ticket just now I see it has a developer assigned to it but has not yet been resolved.

hth
Best regards

Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
Ricaz
User
Posts: 22
Joined: Sun Oct 30, 2011 11:18 am

Re: cropped scanned pdf problem

Post by Ricaz »

Seems the problem has been solved, hasn't it? at least, with the 2.5.201.0 the OCR works without major problems...
User avatar
Paul - Tracker Supp
Site Admin
Posts: 6829
Joined: Wed Mar 25, 2009 10:37 pm
Location: Chemainus, Canada
Contact:

Re: cropped scanned pdf problem

Post by Paul - Tracker Supp »

Hi Ricaz

indeed that was addressed in 201. Sorry for not posting that here. The ticket is closed.

Have a great day!

:-)
Best regards

Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
Post Reply