cropped scanned pdf problem
Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan
cropped scanned pdf problem
Hi there!
First of all, thanks a lot for the OCR function, it is a wonderful surpise, to have such a good function for free!
I am trying to OCR a file I scanned with a photocopying machine (pdf directly into the usb stick).
I've tried first the "preserve original content and add text layer" function, but there was no result, as far as I could understand.
then, I've tried the "convert page content to image only - add text as a layer" function.
At this point, the OCRed text was recognised and added as a transparent layer to the pdf. it can be searched, highlighted, selected, copied and so on...
the problem is that in the process the original pdf image has been moved leftwards, and now half of it lies beyond the left border of the page, being unreadable. This means that the OCRed "invisible" text is actually on the place it should be, but the original text is somewhere else.
I try to attach the image of the resulting OCRed pdf:
I guess that this is due to the fact I had cropped the pdf after scanning it with the photocopier... to crop it, I used BRISS (BRigt Snippet Sire... do you happen to know it?). It seems therefore, that when the OCR function processes the cropped pdf, it moves the scanned image to the 0,0 coordinates of the primitive, uncropped pdf (which means that now the scanned image lies beyond the visible limits of the now cropped file...
I don't know if it is only me having this problem, but I guess it could be useful to let you know about it, just for future developments...
thank you very much... by the way, I selected and copied the invisible OCRed text, and it seems the recognition works pretty well...
Ricaz
First of all, thanks a lot for the OCR function, it is a wonderful surpise, to have such a good function for free!
I am trying to OCR a file I scanned with a photocopying machine (pdf directly into the usb stick).
I've tried first the "preserve original content and add text layer" function, but there was no result, as far as I could understand.
then, I've tried the "convert page content to image only - add text as a layer" function.
At this point, the OCRed text was recognised and added as a transparent layer to the pdf. it can be searched, highlighted, selected, copied and so on...
the problem is that in the process the original pdf image has been moved leftwards, and now half of it lies beyond the left border of the page, being unreadable. This means that the OCRed "invisible" text is actually on the place it should be, but the original text is somewhere else.
I try to attach the image of the resulting OCRed pdf:
I guess that this is due to the fact I had cropped the pdf after scanning it with the photocopier... to crop it, I used BRISS (BRigt Snippet Sire... do you happen to know it?). It seems therefore, that when the OCR function processes the cropped pdf, it moves the scanned image to the 0,0 coordinates of the primitive, uncropped pdf (which means that now the scanned image lies beyond the visible limits of the now cropped file...
I don't know if it is only me having this problem, but I guess it could be useful to let you know about it, just for future developments...
thank you very much... by the way, I selected and copied the invisible OCRed text, and it seems the recognition works pretty well...
Ricaz
Re: cropped scanned pdf problem
of course, the image did not work...
here is the link
https://picasaweb.google.com/1126646376 ... n-aVt8CFNg
here is the link
https://picasaweb.google.com/1126646376 ... n-aVt8CFNg
- Paul - Tracker Supp
- Site Admin
- Posts: 6829
- Joined: Wed Mar 25, 2009 10:37 pm
- Location: Chemainus, Canada
- Contact:
Re: cropped scanned pdf problem
Hi
thanks for that.
I'm not 100% sure what has happened here. I suspect it is related to the crop that was performed prior to the OCR. A crop by definition in a PDF redefines the visible area of a PDF, it doesn't actually remove the cropped content. Would that seem consistent with what you are seeing? The best thing you can do here is send us the original PDF, preferably pre and post crop. If you do send the PDFs for us to look at then be sure to zip them in an archive ir, like your image, the forum software will strip the attachment.
While on the subject of posting attachments, since you have that image on a publicly accessible URL you can simply use the BBCode 'Img' tag to use it in your post directly like this:
thanks for that.
I'm not 100% sure what has happened here. I suspect it is related to the crop that was performed prior to the OCR. A crop by definition in a PDF redefines the visible area of a PDF, it doesn't actually remove the cropped content. Would that seem consistent with what you are seeing? The best thing you can do here is send us the original PDF, preferably pre and post crop. If you do send the PDFs for us to look at then be sure to zip them in an archive ir, like your image, the forum software will strip the attachment.
While on the subject of posting attachments, since you have that image on a publicly accessible URL you can simply use the BBCode 'Img' tag to use it in your post directly like this:
Code: Select all
[img]https://picasaweb.google.com/112664637657417330279/RecentlyUpdated?authkey=Gv1sRgCM-un-aVt8CFNg[/img]
Best regards
Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
Re: cropped scanned pdf problem
Dear Paul,
thanks for your reply.
I guess the cropping is indeed the reason of the problem.
I tried with other files, and it worked when they had not been previously cropper.
I tried to first OCR one file, and only then crop it... and it works perfectly.
if you think it is useful, I attach some files:
goldman: the original, uncropped unOCRed
goldman_cropped: only cropped, not OCRed
goldman_cropped_OCRmode2: cropped first, and only then OCRed... it has the problem of the movement of the image
goldman_mode2_72: only OCRed, with the "convert page content to image only" mode
goldman_mode2_72_cropped: the file above, but cropped after the OCR, and this is good...
as I said, I used BRISS to crop the files.
As far as you know, is there any way to recover the "visible area" of cropped pdf files whose originals have been deleted?
Another question: why does the "preserve original content and add text layer" function does not do anything (at least anything I can see)?
thanks for your reply.
I guess the cropping is indeed the reason of the problem.
I tried with other files, and it worked when they had not been previously cropper.
I tried to first OCR one file, and only then crop it... and it works perfectly.
if you think it is useful, I attach some files:
goldman: the original, uncropped unOCRed
goldman_cropped: only cropped, not OCRed
goldman_cropped_OCRmode2: cropped first, and only then OCRed... it has the problem of the movement of the image
goldman_mode2_72: only OCRed, with the "convert page content to image only" mode
goldman_mode2_72_cropped: the file above, but cropped after the OCR, and this is good...
as I said, I used BRISS to crop the files.
As far as you know, is there any way to recover the "visible area" of cropped pdf files whose originals have been deleted?
Another question: why does the "preserve original content and add text layer" function does not do anything (at least anything I can see)?
- Attachments
-
- CroppedPdf.zip
- files cropped and OCRed
- (2.03 MiB) Downloaded 487 times
- Paul - Tracker Supp
- Site Admin
- Posts: 6829
- Joined: Wed Mar 25, 2009 10:37 pm
- Location: Chemainus, Canada
- Contact:
Re: cropped scanned pdf problem
Hi again Ricaz,
thanks for the samples.
hth
thanks for the samples.
I will need to discuss this with one of my dev team as I do not know how to recover that cropped area if possible. Given that the content is still there and it s just the viewable area that is redefined I suspect this may be possible, I just need to find out how/if it can be done. It may not be possible with the end user tools.As far as you know, is there any way to recover the "visible area" of cropped pdf files whose originals have been deleted?
If a PDF is a combination of text and images this option will leave the text as is and only OCR the images. This is usually the best method to use as it preserves what is already known about that text. The option to 'Convert Page Content to Image only - Add Text as a Layer' will discard what is already known about text, convert the entire document to an image then OCR the entire documnet. This may be preferable to some users but comes at the price of possible OCR errors in text that was already known so if you don't need this option I suggest using 'Preserve Original Content & Add Text Layer'Another question: why does the "preserve original content and add text layer" function does not do anything (at least anything I can see)?
hth
Best regards
Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
Re: cropped scanned pdf problem
Dear Paul,
just to be sure I got it..
In your last post you answered to my question...
Quote:
Question: why does the "preserve original content and add text layer" function does not do anything (at least anything I can see)?
Your Answer: If a PDF is a combination of text and images this option will leave the text as is and only OCR the images.
When I tried to use it on "images only" pdfs (like the original of the one I attached), I had no result whatsoever: is this normal?
Isn't there any way to have a OCR transparent layer added WITHOUT re-sampling the original pdf image? the "convert page content to image only" of course does convert the image (and it asks for the image quality)... this has a heavy cost in terms of dimension: as with the small samples I attached earlier, I went from an NON-OCRed original 310 kbites (very good image quality, though) to an OCRed 530 kb file, with the lowest resolution (72) which resulted in a very bad image quality. I guess that increasing the replaced image quality to, say, 200 instead of 72 would mean to have a final file of four times the original dimension.
Sorry if I don't get it...
Thank you very much!
Ricaz
just to be sure I got it..
In your last post you answered to my question...
Quote:
Question: why does the "preserve original content and add text layer" function does not do anything (at least anything I can see)?
Your Answer: If a PDF is a combination of text and images this option will leave the text as is and only OCR the images.
When I tried to use it on "images only" pdfs (like the original of the one I attached), I had no result whatsoever: is this normal?
Isn't there any way to have a OCR transparent layer added WITHOUT re-sampling the original pdf image? the "convert page content to image only" of course does convert the image (and it asks for the image quality)... this has a heavy cost in terms of dimension: as with the small samples I attached earlier, I went from an NON-OCRed original 310 kbites (very good image quality, though) to an OCRed 530 kb file, with the lowest resolution (72) which resulted in a very bad image quality. I guess that increasing the replaced image quality to, say, 200 instead of 72 would mean to have a final file of four times the original dimension.
Sorry if I don't get it...
Thank you very much!
Ricaz
- Tracker Supp-Stefan
- Site Admin
- Posts: 17820
- Joined: Mon Jan 12, 2009 8:07 am
- Location: London
- Contact:
Re: cropped scanned pdf problem
Hello Ricaz,
We will investigate this issue, as it should be both possible to just add the new invisible layer with the first OCR option (which doesn't work with your sample), as well as the shifting of the image.
As it's the week between Christmas and New Year - this might not be investigated today - but we will get to it as soon as possible and update this topic when we have any more news.
Best,
Stefan
We will investigate this issue, as it should be both possible to just add the new invisible layer with the first OCR option (which doesn't work with your sample), as well as the shifting of the image.
As it's the week between Christmas and New Year - this might not be investigated today - but we will get to it as soon as possible and update this topic when we have any more news.
Best,
Stefan
- Paul - Tracker Supp
- Site Admin
- Posts: 6829
- Joined: Wed Mar 25, 2009 10:37 pm
- Location: Chemainus, Canada
- Contact:
Re: cropped scanned pdf problem
Hi Ricaz,
sorry to have to tell you but the OCR specialist who would look at this post is on holidays until next week. Will it be OK for you to wait for his return?
regards
sorry to have to tell you but the OCR specialist who would look at this post is on holidays until next week. Will it be OK for you to wait for his return?
regards
Best regards
Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
Re: cropped scanned pdf problem
HI there!
Mates, I'm not in a rush... there is no problem at all (I should be writing my thesis, so that's better actually)... let him enjoy his holiday
thank you very much for your attention, though!
Ricaz
Mates, I'm not in a rush... there is no problem at all (I should be writing my thesis, so that's better actually)... let him enjoy his holiday
thank you very much for your attention, though!
Ricaz
-
- User
- Posts: 191
- Joined: Thu Jun 02, 2011 3:23 pm
Re: cropped scanned pdf problem
Thanks for your patience!
Regards,
Jamie
Regards,
Jamie
-
- User
- Posts: 381
- Joined: Mon Jun 13, 2011 5:10 pm
Re: cropped scanned pdf problem
As I understand it, there are two issues here?
1. Image placement in cropped PDFs is incorrect after OCR.
2. Preserve original content is not working for you with an image-only PDF (different PDF, not cropped?)
Thank you,
Walter
1. Image placement in cropped PDFs is incorrect after OCR.
2. Preserve original content is not working for you with an image-only PDF (different PDF, not cropped?)
Thank you,
Walter
Re: cropped scanned pdf problem
Hi Walter,
thanks for your post.
yes, there are two issues, and you got them right... anyway, I try to summarize...
1. Image placement in cropped PDFs is incorrect after OCR.
if the pdf has been previously cropped, when the OCR is made with the "Convert Page Only to Image Content" mode the invisible text layer is placed in the right place,
but the visible re-created image layer is displaced, I guess because its coordinates 0,0 are reset according to the pre-crop pdf page dimensions. Therefore, since I
normally crop the left part of the page, the recreated image is moved leftwards. there is a snapshot of this "wrong" outcome above.
However, if first I make the OCR with the "Convert Page Only to Image Content" mode, and only after that I crop, the file is ok.
Is there any way to get a correct result for pdfs whose un-cropped original has been deleted? maybe a kind of undo-crop? or making the OCR understand that the original
image has been cropped?
2. Second point:
Preserve original content is not working for you with an image-only PDF.
In your reply you added: "different PDF, not cropped"... actually, it does not seem to work with any kind of scanned pdf...
Actually I found out -- JUST NOW -- that this "preserve original" mode works on a particular kind of files, ie pdf files I created from pictures with "bullzip pdf printer" (with pictures I literally mean photographies of book pages, taken with a photocamera). There is just a small, quite funny problem, ie the orientation of the words in the invisible layer... see later and attachment).
Seeing that it works with pdfs made from pictures, I just tried it with a pdf which I downloaded, and which was originally not searchable... The OCR works perfectly even here...
Therefore, I guess the problem I had was in the nature of the pdf I was trying to OCR. I created those pdfs with a photocopying machine which scans and saves as pdf directly into the usb stick: with the files created in this way (both cropped and uncropped) the OCR with "preserve original content" does not work at all. If I OCR one of these pdfs, there is no change (no search, no selection); [however, if I save the pdf after the OCR, the file dimension has increased of roughly a third].
... however, I just tried to "print" one of this "photocopying-machine-scanned pdfs" with "bullzip pdf printer"... if I OCR the re-printed file it works lol
However, there is the same, quite funny problem mentioned above for the "picture-based pdf". Individual words are a (sort of) rotated 90 degrees clockwise... (see attachment, it is rather difficult to explain).... this issue affects fonts based tools (highlight, underline), and selection, but search function is fine (and that's already a massive improvement, for me...)
Now, I'll try tomorrow to find out what is the brand and model of the photocopying machine I used...are you interested to know about it?
in the meanwhile I attach a sample of a file scanned with that machine, reprinted with "Bullzip pdf printer" and only then OCRed...
thanks for your post.
yes, there are two issues, and you got them right... anyway, I try to summarize...
1. Image placement in cropped PDFs is incorrect after OCR.
if the pdf has been previously cropped, when the OCR is made with the "Convert Page Only to Image Content" mode the invisible text layer is placed in the right place,
but the visible re-created image layer is displaced, I guess because its coordinates 0,0 are reset according to the pre-crop pdf page dimensions. Therefore, since I
normally crop the left part of the page, the recreated image is moved leftwards. there is a snapshot of this "wrong" outcome above.
However, if first I make the OCR with the "Convert Page Only to Image Content" mode, and only after that I crop, the file is ok.
Is there any way to get a correct result for pdfs whose un-cropped original has been deleted? maybe a kind of undo-crop? or making the OCR understand that the original
image has been cropped?
2. Second point:
Preserve original content is not working for you with an image-only PDF.
In your reply you added: "different PDF, not cropped"... actually, it does not seem to work with any kind of scanned pdf...
Actually I found out -- JUST NOW -- that this "preserve original" mode works on a particular kind of files, ie pdf files I created from pictures with "bullzip pdf printer" (with pictures I literally mean photographies of book pages, taken with a photocamera). There is just a small, quite funny problem, ie the orientation of the words in the invisible layer... see later and attachment).
Seeing that it works with pdfs made from pictures, I just tried it with a pdf which I downloaded, and which was originally not searchable... The OCR works perfectly even here...
Therefore, I guess the problem I had was in the nature of the pdf I was trying to OCR. I created those pdfs with a photocopying machine which scans and saves as pdf directly into the usb stick: with the files created in this way (both cropped and uncropped) the OCR with "preserve original content" does not work at all. If I OCR one of these pdfs, there is no change (no search, no selection); [however, if I save the pdf after the OCR, the file dimension has increased of roughly a third].
... however, I just tried to "print" one of this "photocopying-machine-scanned pdfs" with "bullzip pdf printer"... if I OCR the re-printed file it works lol
However, there is the same, quite funny problem mentioned above for the "picture-based pdf". Individual words are a (sort of) rotated 90 degrees clockwise... (see attachment, it is rather difficult to explain).... this issue affects fonts based tools (highlight, underline), and selection, but search function is fine (and that's already a massive improvement, for me...)
Now, I'll try tomorrow to find out what is the brand and model of the photocopying machine I used...are you interested to know about it?
in the meanwhile I attach a sample of a file scanned with that machine, reprinted with "Bullzip pdf printer" and only then OCRed...
- Attachments
-
- godmanBULLZIP2.7z
- (382.21 KiB) Downloaded 470 times
-
- User
- Posts: 381
- Joined: Mon Jun 13, 2011 5:10 pm
Re: cropped scanned pdf problem
Thank you; part of this relates to another issue that we are in the process of fixing, but we will investigate the other issues here and provide fixes as soon as possible.
Thanks for letting us know and we will update you as we work on this.
Thanks for letting us know and we will update you as we work on this.
-
- User
- Posts: 381
- Joined: Mon Jun 13, 2011 5:10 pm
Re: cropped scanned pdf problem
An internal ticket has been created to track this issue (Ticket #1396).
-Walter
-Walter
Re: cropped scanned pdf problem
Sorry to open this discussion again... just asking whether you are going to post here when this issue is dealt with...
thanks!
thanks!
- Paul - Tracker Supp
- Site Admin
- Posts: 6829
- Joined: Wed Mar 25, 2009 10:37 pm
- Location: Chemainus, Canada
- Contact:
Re: cropped scanned pdf problem
Hi Ricaz,
no need to apologize! Yes - this forum thread is referenced in the Support Ticket so once a resolution is found we will update you here on this thread.
Looking at the ticket just now I see it has a developer assigned to it but has not yet been resolved.
hth
no need to apologize! Yes - this forum thread is referenced in the Support Ticket so once a resolution is found we will update you here on this thread.
Looking at the ticket just now I see it has a developer assigned to it but has not yet been resolved.
hth
Best regards
Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
Re: cropped scanned pdf problem
Seems the problem has been solved, hasn't it? at least, with the 2.5.201.0 the OCR works without major problems...
- Paul - Tracker Supp
- Site Admin
- Posts: 6829
- Joined: Wed Mar 25, 2009 10:37 pm
- Location: Chemainus, Canada
- Contact:
Re: cropped scanned pdf problem
Hi Ricaz
indeed that was addressed in 201. Sorry for not posting that here. The ticket is closed.
Have a great day!
indeed that was addressed in 201. Sorry for not posting that here. The ticket is closed.
Have a great day!
Best regards
Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com