Urgent -- despeckle PDF without/before OCR

PDF-X OCR SDK is a New product from us and intended to compliment our existing PDF and Imaging Tools to provide the Developer with an expanding set of professional tools for Optical Character Recognition tasks

Moderators: Tracker Support, TrackerSupp-Daniel, Chris - Tracker Supp, Vasyl-Tracker Dev Team, Sean - Tracker, Tracker Supp-Stefan

Post Reply
User avatar
David.P
User
Posts: 835
Joined: Thu Feb 28, 2008 8:16 pm
Location: Germany

Urgent -- despeckle PDF without/before OCR

Post by David.P » Tue May 13, 2014 10:58 am

Hi forum,

not exactly sure if I'm in the correct sub-forum, but I've got an urgent problem regarding OCR.

I have some large pdf files here that have been generated elsewhere by scanning paper documents (or something). Unfortunately, some mad conversion setting has been used such that the text looks like this:
Image

With all of the OCR tools that I own, such text is not recognisable.

However, after running a Median Filter (aka "despeckle") over the image (using IrfanView), the image looks like this:
Image

... and is perfectly converted to text for example by Ad*be Acr*bat ClearScan:
Image

Now the problem is, how can I batch-despeckle those PDF documents of several hundred pages size, such that I can OCR them afterwards with the tool of my choice?

IrfanView & Ghostscript basically are a terrible team for handling PDF. Batch conversion seems to be almost impossible there (or takes like days to complete, if at all).

Thanks HEAPS already for any help,

Regards David.P
Last edited by David.P on Tue May 13, 2014 5:05 pm, edited 2 times in total.
David.P
PDF-XChange Pro

User avatar
Tracker Supp-Stefan
Site Admin
Posts: 13331
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: Urgent -- despeckle PDF without/before OCR

Post by Tracker Supp-Stefan » Tue May 13, 2014 1:08 pm

Hello David,

If those files contain one image per page you can try to use the PDF Tools to export all pages from the PDF file(s), then batch despeckle them, and then make new PDF files and OCR them at the same time using the Editor's File -> New Document -> From Images... option.

Regards,
Stefan

User avatar
David.P
User
Posts: 835
Joined: Thu Feb 28, 2008 8:16 pm
Location: Germany

Re: Urgent -- despeckle PDF without/before OCR

Post by David.P » Tue May 13, 2014 2:07 pm

Yep Stefan, thanks -- that's the way I could do it with IrfanView as well (IrfanView is great for images, but terrible with PDF).

However, I would rather (if that is possible at all) despeckle the PDF files directly because of their size (like hundreds of pages) instead of having to disassemble and reassemble them.

Actually, Adobe Acrobat can do it (despeckle PDF files) -- sort of, because the effect is much to weak in the present case.

If there is no other way, I shall disassemble those files... :(

Best regards
David
David.P
PDF-XChange Pro

User avatar
Tracker Supp-Stefan
Site Admin
Posts: 13331
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: Urgent -- despeckle PDF without/before OCR

Post by Tracker Supp-Stefan » Tue May 13, 2014 3:13 pm

Hi David,

There's no way (yet) for you to directly manipulate the images inside the existing files, however - have you tried to maybe "reprint" them through our printing drivers with some graphics compression options? This might give you the desired despeckled images?

Regards,
Stefan

User avatar
David.P
User
Posts: 835
Joined: Thu Feb 28, 2008 8:16 pm
Location: Germany

Re: Urgent -- despeckle PDF without/before OCR

Post by David.P » Tue May 13, 2014 3:16 pm

Thank you Stefan, that could be indeed another possible option, depending that some kind of blurring can be achieved when printing.

I'll let you know the results.

Best regards
David.P
David.P
PDF-XChange Pro

User avatar
Tracker Supp-Stefan
Site Admin
Posts: 13331
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: Urgent -- despeckle PDF without/before OCR

Post by Tracker Supp-Stefan » Tue May 13, 2014 3:36 pm

Looking forward to it David!

Cheers,
Stefan

User avatar
David.P
User
Posts: 835
Joined: Thu Feb 28, 2008 8:16 pm
Location: Germany

Re: Urgent -- despeckle PDF without/before OCR

Post by David.P » Wed May 14, 2014 2:09 pm

Stefan! That worked a treat!

Thank you for your idea!

PDF document before, not OCR'able:
Image

After printing to PDF with PDF-XChange with the settings below
(downsample from 300dpi b/w to 200dpi grey in order to get blurring):
Image

Then finally, after OCR:
Image

And here's the printing settings:
Image

Thanks once more!
Regards David.P
Last edited by David.P on Fri Sep 18, 2015 3:56 pm, edited 1 time in total.
David.P
PDF-XChange Pro

User avatar
Tracker Supp-Stefan
Site Admin
Posts: 13331
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: Urgent -- despeckle PDF without/before OCR

Post by Tracker Supp-Stefan » Wed May 14, 2014 2:27 pm

Thanks for sharing this David!

And glad I could help.

Cheers,
Stefan

User avatar
David.P
User
Posts: 835
Joined: Thu Feb 28, 2008 8:16 pm
Location: Germany

Re: Urgent -- despeckle PDF without/before OCR

Post by David.P » Wed May 14, 2014 2:55 pm

PS:
Here's an example for the amount of text that was (not) available before the procedure:
Image
(only like 5 word fragments on two entire pages)

...and afterwards:
Image

This saved my day (and possibly, my liability insurance:)

Regards David.P
:)
Last edited by David.P on Sat May 17, 2014 10:59 am, edited 1 time in total.
David.P
PDF-XChange Pro

User avatar
Tracker Supp-Stefan
Site Admin
Posts: 13331
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: Urgent -- despeckle PDF without/before OCR

Post by Tracker Supp-Stefan » Wed May 14, 2014 3:14 pm

:)

Post Reply