Page 1 of 1

After OCR output file are 4x larger than Input files.

Posted: Fri May 23, 2014 3:14 pm
by docu-track99
Hello Tracker,

I am using OCRTools.dll in my custom OCR application. Version number is 1.0.14.1. After file get processed they get 4x larger than their actual size. Could you please tell me what can possibly cause this? I can also show you my code if required.

Re: After OCR output file are 4x larger than Input files.

Posted: Fri May 23, 2014 7:10 pm
by Will - Tracker Supp
Hi docutrack99,

Thanks for the post - I've passed this along to the Dev Team. In the meantime, could you send us the code?

Cheers,

Re: After OCR output file are 4x larger than Input files.

Posted: Fri May 23, 2014 7:53 pm
by docu-track99
Hello Tracker,

Below is the main portion of code which uses OCRTools.dll

hResult = PDFXOCR.PDFXOCR_Funcs.OCR_Init(pdf, key, code)

If PDFXOCR.PDFXOCR_Funcs.IS_DS_FAILED(hResult) Then
lblStatus2.Text = "Error"
MessageBox.Show("OCR Initialization failure.")
End If

hResult = PDFXOCR.PDFXOCR_Funcs.OCR_SetCallback(pdf, AddressOf thecallback, 0)
lblStatus2.Text = "OCR saving output..."

SavePath1 = System.IO.Path.Combine(m_SourceFilename, sFileName)
hResult = PDFXOCR.PDFXOCR_Funcs.OCR_LoadW(pdf, SavePath1)
If PDFXOCR.PDFXOCR_Funcs.IS_DS_FAILED(hResult) Then
MessageBox.Show(Convert.ToString("Error loading file: " & vbLf) & sFileName, "OCR Library Error")
Return
End If
Dim Options As New PDFXOCR.PDFXOCR_Funcs.PXO_Options()
Options.blacklist = ""
Options.whitelist = ""
Options.raster_dpi = m_DPI
Options.ImageFlags = CUInt(PDFXOCR.PDFXOCR_Funcs.OCR_ImageProcessingFlags.OCR_Image_FastAutorotate)
Options.DataPath = m_Datapath
Options.lang = m_Language
Options.RegionMode = PDFXOCR.PDFXOCR_Funcs.OCR_RegionMode.OCR_Auto
Options.reserved = 0

Dim pxoPagelist As IntPtr = IntPtr.Zero
' null pointer passed to OCR_MakeSearchable() will result in all pages being OCRd.
hResult = PDFXOCR.PDFXOCR_Funcs.OCR_MakeSearchable(pdf, Options, pxoPagelist)

If PDFXOCR.PDFXOCR_Funcs.IS_DS_FAILED(hResult) Then
MessageBox.Show("Error running searchable." & vbLf & "Error code: " + hResult.ToString())
Return
Else
OCRretcode = hResult.ToString()
End If

SavePath = System.IO.Path.Combine(m_DestFilename, sFileName)
hResult = PDFXOCR.PDFXOCR_Funcs.OCR_SaveW(pdf, SavePath)

If PDFXOCR.PDFXOCR_Funcs.IS_DS_FAILED(hResult) Then
MessageBox.Show("Error saving output PDF file." & vbLf & "Error code: " + hResult.ToString())
Return
Else
Dim outputstring As String
outputstring = (Convert.ToString((Convert.ToString("File saved to: ") & m_DestFilename) + " (OCR_MakeSearchable returned: ") & OCRretcode) + ")"
textBox1.Text = outputstring
textBox1.Update()
End If
PDFXOCR.PDFXOCR_Funcs.OCR_Delete(pdf)
lblStatus2.Text = "OCR complete"

Re: After OCR output file are 4x larger than Input files.

Posted: Fri May 23, 2014 8:15 pm
by Will - Tracker Supp
Hi docu-track99,

Thanks for that - I'll make sure it's passed along.

Cheers,

Re: After OCR output file are 4x larger than Input files.

Posted: Wed May 28, 2014 2:44 pm
by docu-track99
Hello Tracker,

Is there any update on this issue yet?

Re: After OCR output file are 4x larger than Input files.

Posted: Wed May 28, 2014 6:05 pm
by John - Tracker Supp
Not at this time I am afraid - can we ask for a 'before and after' PDF example that is not too big to allow us to analyse and investigate further please.

Re: After OCR output file are 4x larger than Input files.

Posted: Thu May 29, 2014 8:12 pm
by docu-track99
Hello Tracker,

Here are the links:


Before: http://doc-it.ftpstream.com/158159/ded1 ... JBIBLE.pdf


After: http://doc-it.ftpstream.com/158159/ab4e ... KJBIBLE.7z


This happens using the code provided before. Please let us know if you need any further information.

Re: After OCR output file are 4x larger than Input files.

Posted: Thu May 29, 2014 8:45 pm
by docu-track99
Hello Tracker,

So, our main question here is in the PDF Editor -> OCR Pages option, the process only takes abut 12 mins and the output file size increases slightly to 2.8 MBs. On the other hand, by using the code written above also provided by Tracker ,whole process takes about 1 hour and size of the output file increases significantly to 890MBs.

Re: After OCR output file are 4x larger than Input files.

Posted: Fri May 30, 2014 8:13 pm
by John - Tracker Supp
Thanks - we are investigating using the info supplied and will come back.

Re: After OCR output file are 4x larger than Input files.

Posted: Fri May 30, 2014 8:59 pm
by Ivan - Tracker Software
Can you provide a screenshot of Editor's OCR page dialog before doing the OCR. Looks like in the Editor you have different settings than are used in the SDK

Re: After OCR output file are 4x larger than Input files.

Posted: Fri May 30, 2014 9:45 pm
by docu-track99
Hello Tracker,

Attached is the image requested. Please let us know if any code changes are required.

Re: After OCR output file are 4x larger than Input files.

Posted: Tue Jun 03, 2014 1:26 pm
by docu-track99
Hello Tracker,

Just following up with this ticket. Please let me know if you find the cause of size increase.

Re: After OCR output file are 4x larger than Input files.

Posted: Fri Jun 06, 2014 1:09 pm
by docu-track99
Hello Tracker,

I am following up with this ticket. Please let us know your findings.

Re: After OCR output file are 4x larger than Input files.

Posted: Fri Jun 06, 2014 5:53 pm
by Ivan - Tracker Software
When you are using the OCR SDK it rasterizes each page to an image (at 300 dpi), and adds this image to the new document removing the old content.

So, you will have image based document with an invisible text layer. And, because your document contains many pages, the resulting PDF is big.

When you use the OCR feature in the viewer, you use a different method: the page is rasterized, resulting image is used by OCR process, then an invisible text layer is added to the document, but not the image - the original content is retained as is.

It is much faster and resulting PDF is much smaller.

We will need to revise this but it will take a little time I am afraid and may not be done before the release of the Editor SDK - which would make this potentially superfluous anyway ...

Re: After OCR output file are 4x larger than Input files.

Posted: Fri Jun 06, 2014 7:54 pm
by docu-track99
Hello Tracker,

Is there a way we can use PDF Viewer ActiveX Control to programmatically OCR a PDF? Secondly, if we need to wait for new SDK, any idea on the approximate time frame?

Re: After OCR output file are 4x larger than Input files.

Posted: Fri Jun 06, 2014 8:32 pm
by Paul - Tracker Supp
Hi again Paul,

we have raised a support ticket around getting this change to the SDK. RT#2543: After OCR output file are 4x larger than Input files. At this point I can't speculate as to when it will be done however.

hth

Re: After OCR output file are 4x larger than Input files.

Posted: Tue Jun 07, 2016 9:54 am
by Tom Princen
Is this solved? Because I still have this problem?

Re: After OCR output file are 4x larger than Input files.

Posted: Tue Jun 07, 2016 11:46 am
by Tracker Supp-Stefan
Hello Tom,

We are in the process of rewriting our OCR SDK, and when that is released - the above discussed problem should be resolved.

Regards,
Stefan