File size of OCRed PDF docs

This Forum is for the use of End Users requiring help and assistance for Tracker Software's PDF-Tools.

Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Tracker Supp-Stefan

Post Reply
cajuba
User
Posts: 12
Joined: Fri Jun 24, 2016 5:27 am

File size of OCRed PDF docs

Post by cajuba »

Good Morning,

I have set up a small workflow for digitizing all my daily paper stuff. Doing so I found out I have some issue regarding the file size of documents OCRed by PDF-Tools. My workflow:
  • - Scanning a Batch of several docs on my Brother MFC (300 dpi, greyscale, Output as PDF)
    - Opening this Batch pdf file with PDF-Tools V6
    - Splitting the Batch into a number of seperate docs
    - OCR of the separate docs (German language, medium/high)
    - Doc optimization (Standard Settings)
    - Transformation to PDF/A (Auto)
    - Timestamping
    - Saving
After running through that workflow a typical single-page greyscale PDF/A has a size of at least 15MByte. Inacceptable. Processing the same scanner Output file using some other application (I tried Nuance Power PDF and Horland's Scan2PDF) gives me files sized approx. 750 KBytes which seems to be an appropriate size for a single-page doc. After doing some testing (skipping different workflow steps using different apps), I am pretty sure the PDF-Tools' OCR module is responsible for blowing up the files. Interestingly a two-page document is not 30 MBytes, but only slightly larger than a one-page doc. So I guess the OCR module writes some really larger Header or Overhead Information into the files.

Do you have any idea what I can do to stop the software's behaviour? Have I missed any Settings relevant in this context?

Best regards,
cajuba
cajuba
User
Posts: 12
Joined: Fri Jun 24, 2016 5:27 am

Re: File size of OCRed PDF docs

Post by cajuba »

Additional Information:

The behaviour is the same if I scan the docs using PDF-Tools' scan module.
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17893
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: File size of OCRed PDF docs

Post by Tracker Supp-Stefan »

Hello cajuba,

Can you please confirm the build of PDF Tools that you are using?
Also - your Tools license allows you to also use the Editor - can you please try the File -> New Document -> From Scanner... feature in the Editor (you can OCR at the same time too), and let us know the result there?

Regards,
Stefan
cajuba
User
Posts: 12
Joined: Fri Jun 24, 2016 5:27 am

Re: File size of OCRed PDF docs

Post by cajuba »

Dear Stefan,

thank you for your reply.

My build is: V 6.0 Build 317.1

I have tested the issue with the PDF Editor and it behaves exactly the same way as the PDF Tools do, BUT it seems I was wrong earlier today and it is not the OCR module but the PDF/A Transformation which causes the trouble with the PDF-Tools as well as the PDF-Editor: The same image file leads to a filesize of 586 K after OCR and saving as "plain" PDF and a filesize of 15416 K after OCR and saving as PDF/A (Auto).

Sorry my mistake, but the issue is the same for me: too big files!

Best regards,
cajuba
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17893
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: File size of OCRed PDF docs

Post by Tracker Supp-Stefan »

Hello cajuba,

I've just checked with a colleague from the dev team, and while we believe we can replicate the issue with the information already provided, it will greatly help us if you can provide step by step instructions of what you are doing and a sample file, and we will then investigate this in further details.

Also if you scan and OCR the file first and save it (with the small page size), and then use the Editor to resave that file and select to save under the PDF/A sub format from the menu - does that still result in a large page size?

Regards,
Stefan
cajuba
User
Posts: 12
Joined: Fri Jun 24, 2016 5:27 am

Re: File size of OCRed PDF docs

Post by cajuba »

Dear Stefan,

in the meantime I continued testing, always with the same results.
Attached you can find a zip archive containing one of my test files plus some extras:

- 1 PDF-file I scanned on my Brother MFC using PDF-Editor (no OCR, no PDF/A, saved optimized)
- an export file from my registry (subtree HKEY_CURRENT_USER...CustomTools), as far as I have learned from another post here on the forum you can possibly import my custom tool onto your machine
- 3 Screenshots giving you an Impression of my CustomTool I have created and which I use for my workflow

Here is the step-by-step description:
1. Scanning the paper doc (Brother MFC, PDF XChange Editor)
2. Saving the image as PDF (PDF XChange Ed., no OCR, no PDF/A)
3. Starting my Custom Tool (PDF Tools) with the following steps:
  • a) opening the PDF file created in step 2
    b) splitting/merging the docs (here I changed nothing, just renamed the doc and pressed enter)
    c) OCR (German, medium)
    d) Convert to PDF/A (auto, sRGB)
    e) Timestamp (http://zeitstempel.dfn.de/)
    f) Saving the doc
The size of the initial file was 91 KByte (saved optimized). First I ran the workflow with the PDF/A conversion unchecked (so OCR yes, PDF/A no). This led to a filesize of 328 KByte. Then I ran the workflow with the initial file again, this time with step d) checked (OCR yes, PDF/A yes). The resulting filesize was 15.1 MByte. Whatever I do saving as PDF/A really blows up filesize no matter if I use PDF-Editor or PDF-Tools or if I OCR with one application and convert to PDF/A with the other.

I also tried to upload the final OCR + PDF/A file, but it is to big to be uploaded here. Even if I zip it the filesize stays nearly the same.

Best regards,
cajuba
Attachments
Upload.zip
(128.74 KiB) Downloaded 138 times
cajuba
User
Posts: 12
Joined: Fri Jun 24, 2016 5:27 am

Re: File size of OCRed PDF docs

Post by cajuba »

Dear Stefan,

I hope the Information and files I provided in my previous post were detailed enough for you to replicate the issue. Have you found the root cause for it and will you provide a fix soon?

Best regards,
cajuba
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17893
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: File size of OCRed PDF docs

Post by Tracker Supp-Stefan »

Hello cajuba,

Sorry that I didn't follow up sooner. I just checked with my colleague that is working on your case, and he asked me to make this ticket in our internal system:
#3589: OCR: Font subset embedding in PDF/A plugin - Large Files created.
So that we can get this properly fixed.

As soon as there are any news in the ticket - we will let you know.

Regards,
Stefan
cajuba
User
Posts: 12
Joined: Fri Jun 24, 2016 5:27 am

Re: File size of OCRed PDF docs

Post by cajuba »

Dear People at Tracker-Software,

anything new about #3589: OCR: Font subset embedding in PDF/A plugin - Large Files created?
Will it be fixed in the next build and when will this be available?

Regards,
cajuba
User avatar
Radi - Tracker Supp
Site Admin
Posts: 600
Joined: Tue Mar 03, 2015 12:46 pm

Re: File size of OCRed PDF docs

Post by Radi - Tracker Supp »

Hi cajuba,

Unfortunately, there are no updates in this ticket at the moment.
I'll request one and get back to you when I have any information on the subject.

Regards,
Radi
cajuba
User
Posts: 12
Joined: Fri Jun 24, 2016 5:27 am

Re: File size of OCRed PDF docs

Post by cajuba »

Any News on this? :D

Regards,
cajuba
User avatar
Will - Tracker Supp
Site Admin
Posts: 6815
Joined: Mon Oct 15, 2012 9:21 pm
Location: London, UK
Contact:

Re: File size of OCRed PDF docs

Post by Will - Tracker Supp »

Hi cajuba,

Thanks for the post - no further update is available yet. I can see that a workitem has been issued for the ticket, meaning that it has been assigned to a member of the development team and it's on their road map, but no further info. is available.

Thanks,
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.

Best regards

Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
Post Reply