After OCR output file are 4x larger than Input files.

PDF-X OCR SDK is a New product from us and intended to compliment our existing PDF and Imaging Tools to provide the Developer with an expanding set of professional tools for Optical Character Recognition tasks

Moderators: Tracker Support, TrackerSupp-Daniel, Chris - Tracker Supp, Vasyl-Tracker Dev Team, Sean - Tracker, Tracker Supp-Stefan

Post Reply
docu-track99
User
Posts: 495
Joined: Thu Dec 06, 2007 8:13 pm

After OCR output file are 4x larger than Input files.

Post by docu-track99 » Fri May 23, 2014 3:14 pm

Hello Tracker,

I am using OCRTools.dll in my custom OCR application. Version number is 1.0.14.1. After file get processed they get 4x larger than their actual size. Could you please tell me what can possibly cause this? I can also show you my code if required.

User avatar
Will - Tracker Supp
Site Admin
Posts: 6729
Joined: Mon Oct 15, 2012 9:21 pm
Location: London, UK
Contact:

Re: After OCR output file are 4x larger than Input files.

Post by Will - Tracker Supp » Fri May 23, 2014 7:10 pm

Hi docutrack99,

Thanks for the post - I've passed this along to the Dev Team. In the meantime, could you send us the code?

Cheers,
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.

Best regards

Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com

docu-track99
User
Posts: 495
Joined: Thu Dec 06, 2007 8:13 pm

Re: After OCR output file are 4x larger than Input files.

Post by docu-track99 » Fri May 23, 2014 7:53 pm

Hello Tracker,

Below is the main portion of code which uses OCRTools.dll

hResult = PDFXOCR.PDFXOCR_Funcs.OCR_Init(pdf, key, code)

If PDFXOCR.PDFXOCR_Funcs.IS_DS_FAILED(hResult) Then
lblStatus2.Text = "Error"
MessageBox.Show("OCR Initialization failure.")
End If

hResult = PDFXOCR.PDFXOCR_Funcs.OCR_SetCallback(pdf, AddressOf thecallback, 0)
lblStatus2.Text = "OCR saving output..."

SavePath1 = System.IO.Path.Combine(m_SourceFilename, sFileName)
hResult = PDFXOCR.PDFXOCR_Funcs.OCR_LoadW(pdf, SavePath1)
If PDFXOCR.PDFXOCR_Funcs.IS_DS_FAILED(hResult) Then
MessageBox.Show(Convert.ToString("Error loading file: " & vbLf) & sFileName, "OCR Library Error")
Return
End If
Dim Options As New PDFXOCR.PDFXOCR_Funcs.PXO_Options()
Options.blacklist = ""
Options.whitelist = ""
Options.raster_dpi = m_DPI
Options.ImageFlags = CUInt(PDFXOCR.PDFXOCR_Funcs.OCR_ImageProcessingFlags.OCR_Image_FastAutorotate)
Options.DataPath = m_Datapath
Options.lang = m_Language
Options.RegionMode = PDFXOCR.PDFXOCR_Funcs.OCR_RegionMode.OCR_Auto
Options.reserved = 0

Dim pxoPagelist As IntPtr = IntPtr.Zero
' null pointer passed to OCR_MakeSearchable() will result in all pages being OCRd.
hResult = PDFXOCR.PDFXOCR_Funcs.OCR_MakeSearchable(pdf, Options, pxoPagelist)

If PDFXOCR.PDFXOCR_Funcs.IS_DS_FAILED(hResult) Then
MessageBox.Show("Error running searchable." & vbLf & "Error code: " + hResult.ToString())
Return
Else
OCRretcode = hResult.ToString()
End If

SavePath = System.IO.Path.Combine(m_DestFilename, sFileName)
hResult = PDFXOCR.PDFXOCR_Funcs.OCR_SaveW(pdf, SavePath)

If PDFXOCR.PDFXOCR_Funcs.IS_DS_FAILED(hResult) Then
MessageBox.Show("Error saving output PDF file." & vbLf & "Error code: " + hResult.ToString())
Return
Else
Dim outputstring As String
outputstring = (Convert.ToString((Convert.ToString("File saved to: ") & m_DestFilename) + " (OCR_MakeSearchable returned: ") & OCRretcode) + ")"
textBox1.Text = outputstring
textBox1.Update()
End If
PDFXOCR.PDFXOCR_Funcs.OCR_Delete(pdf)
lblStatus2.Text = "OCR complete"

User avatar
Will - Tracker Supp
Site Admin
Posts: 6729
Joined: Mon Oct 15, 2012 9:21 pm
Location: London, UK
Contact:

Re: After OCR output file are 4x larger than Input files.

Post by Will - Tracker Supp » Fri May 23, 2014 8:15 pm

Hi docu-track99,

Thanks for that - I'll make sure it's passed along.

Cheers,
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.

Best regards

Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com

docu-track99
User
Posts: 495
Joined: Thu Dec 06, 2007 8:13 pm

Re: After OCR output file are 4x larger than Input files.

Post by docu-track99 » Wed May 28, 2014 2:44 pm

Hello Tracker,

Is there any update on this issue yet?

John - Tracker Supp
Site Admin
Posts: 8202
Joined: Tue Jun 29, 2004 10:34 am
Location: Vancouver Island - Canada
Contact:

Re: After OCR output file are 4x larger than Input files.

Post by John - Tracker Supp » Wed May 28, 2014 6:05 pm

Not at this time I am afraid - can we ask for a 'before and after' PDF example that is not too big to allow us to analyse and investigate further please.
If posting files to this forum - you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded - thank you.

Best regards
Tracker Support
http://www.tracker-software.com

docu-track99
User
Posts: 495
Joined: Thu Dec 06, 2007 8:13 pm

Re: After OCR output file are 4x larger than Input files.

Post by docu-track99 » Thu May 29, 2014 8:12 pm

Hello Tracker,

Here are the links:


Before: http://doc-it.ftpstream.com/158159/ded1 ... JBIBLE.pdf


After: http://doc-it.ftpstream.com/158159/ab4e ... KJBIBLE.7z


This happens using the code provided before. Please let us know if you need any further information.

docu-track99
User
Posts: 495
Joined: Thu Dec 06, 2007 8:13 pm

Re: After OCR output file are 4x larger than Input files.

Post by docu-track99 » Thu May 29, 2014 8:45 pm

Hello Tracker,

So, our main question here is in the PDF Editor -> OCR Pages option, the process only takes abut 12 mins and the output file size increases slightly to 2.8 MBs. On the other hand, by using the code written above also provided by Tracker ,whole process takes about 1 hour and size of the output file increases significantly to 890MBs.

John - Tracker Supp
Site Admin
Posts: 8202
Joined: Tue Jun 29, 2004 10:34 am
Location: Vancouver Island - Canada
Contact:

Re: After OCR output file are 4x larger than Input files.

Post by John - Tracker Supp » Fri May 30, 2014 8:13 pm

Thanks - we are investigating using the info supplied and will come back.
If posting files to this forum - you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded - thank you.

Best regards
Tracker Support
http://www.tracker-software.com

Ivan - Tracker Software
Site Admin
Posts: 3591
Joined: Thu Jul 08, 2004 10:36 pm
Location: Vancouver Island - Canada
Contact:

Re: After OCR output file are 4x larger than Input files.

Post by Ivan - Tracker Software » Fri May 30, 2014 8:59 pm

Can you provide a screenshot of Editor's OCR page dialog before doing the OCR. Looks like in the Editor you have different settings than are used in the SDK
Tracker Software (Project Director)

When attaching files to any message - please ensure they are archived and posted as a .ZIP, .RAR or .7z format - or they will not be posted - thanks.

docu-track99
User
Posts: 495
Joined: Thu Dec 06, 2007 8:13 pm

Re: After OCR output file are 4x larger than Input files.

Post by docu-track99 » Fri May 30, 2014 9:45 pm

Hello Tracker,

Attached is the image requested. Please let us know if any code changes are required.
Attachments
ocr page dialog.zip
Screenshot of OCR Page Dialog.
(32.02 KiB) Downloaded 113 times

docu-track99
User
Posts: 495
Joined: Thu Dec 06, 2007 8:13 pm

Re: After OCR output file are 4x larger than Input files.

Post by docu-track99 » Tue Jun 03, 2014 1:26 pm

Hello Tracker,

Just following up with this ticket. Please let me know if you find the cause of size increase.

docu-track99
User
Posts: 495
Joined: Thu Dec 06, 2007 8:13 pm

Re: After OCR output file are 4x larger than Input files.

Post by docu-track99 » Fri Jun 06, 2014 1:09 pm

Hello Tracker,

I am following up with this ticket. Please let us know your findings.

Ivan - Tracker Software
Site Admin
Posts: 3591
Joined: Thu Jul 08, 2004 10:36 pm
Location: Vancouver Island - Canada
Contact:

Re: After OCR output file are 4x larger than Input files.

Post by Ivan - Tracker Software » Fri Jun 06, 2014 5:53 pm

When you are using the OCR SDK it rasterizes each page to an image (at 300 dpi), and adds this image to the new document removing the old content.

So, you will have image based document with an invisible text layer. And, because your document contains many pages, the resulting PDF is big.

When you use the OCR feature in the viewer, you use a different method: the page is rasterized, resulting image is used by OCR process, then an invisible text layer is added to the document, but not the image - the original content is retained as is.

It is much faster and resulting PDF is much smaller.

We will need to revise this but it will take a little time I am afraid and may not be done before the release of the Editor SDK - which would make this potentially superfluous anyway ...
Tracker Software (Project Director)

When attaching files to any message - please ensure they are archived and posted as a .ZIP, .RAR or .7z format - or they will not be posted - thanks.

docu-track99
User
Posts: 495
Joined: Thu Dec 06, 2007 8:13 pm

Re: After OCR output file are 4x larger than Input files.

Post by docu-track99 » Fri Jun 06, 2014 7:54 pm

Hello Tracker,

Is there a way we can use PDF Viewer ActiveX Control to programmatically OCR a PDF? Secondly, if we need to wait for new SDK, any idea on the approximate time frame?

User avatar
Paul - Tracker Supp
Site Admin
Posts: 4906
Joined: Wed Mar 25, 2009 10:37 pm
Location: Chemainus, Canada
Contact:

Re: After OCR output file are 4x larger than Input files.

Post by Paul - Tracker Supp » Fri Jun 06, 2014 8:32 pm

Hi again Paul,

we have raised a support ticket around getting this change to the SDK. RT#2543: After OCR output file are 4x larger than Input files. At this point I can't speculate as to when it will be done however.

hth
_________________
If posting files to this forum - you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded - thank you.

Best regards

Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com

Tom Princen
User
Posts: 83
Joined: Wed Mar 25, 2015 10:15 am

Re: After OCR output file are 4x larger than Input files.

Post by Tom Princen » Tue Jun 07, 2016 9:54 am

Is this solved? Because I still have this problem?

User avatar
Tracker Supp-Stefan
Site Admin
Posts: 13376
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: After OCR output file are 4x larger than Input files.

Post by Tracker Supp-Stefan » Tue Jun 07, 2016 11:46 am

Hello Tom,

We are in the process of rewriting our OCR SDK, and when that is released - the above discussed problem should be resolved.

Regards,
Stefan

Post Reply