After OCR output file are 4x larger than Input files.

docu-track99 · Post by **docu-track99** » Fri May 23, 2014 3:14 pm

Hello Tracker,

I am using OCRTools.dll in my custom OCR application. Version number is 1.0.14.1. After file get processed they get 4x larger than their actual size. Could you please tell me what can possibly cause this? I can also show you my code if required.

Post by **Will - Tracker Supp** » Fri May 23, 2014 7:10 pm

Hi docutrack99,

Thanks for the post - I've passed this along to the Dev Team. In the meantime, could you send us the code?

Cheers,

docu-track99 · Post by **docu-track99** » Fri May 23, 2014 7:53 pm

Hello Tracker,

Below is the main portion of code which uses OCRTools.dll

hResult = PDFXOCR.PDFXOCR_Funcs.OCR_Init(pdf, key, code)

If PDFXOCR.PDFXOCR_Funcs.IS_DS_FAILED(hResult) Then
lblStatus2.Text = "Error"
MessageBox.Show("OCR Initialization failure.")
End If

hResult = PDFXOCR.PDFXOCR_Funcs.OCR_SetCallback(pdf, AddressOf thecallback, 0)
lblStatus2.Text = "OCR saving output..."

SavePath1 = System.IO.Path.Combine(m_SourceFilename, sFileName)
hResult = PDFXOCR.PDFXOCR_Funcs.OCR_LoadW(pdf, SavePath1)
If PDFXOCR.PDFXOCR_Funcs.IS_DS_FAILED(hResult) Then
MessageBox.Show(Convert.ToString("Error loading file: " & vbLf) & sFileName, "OCR Library Error")
Return
End If
Dim Options As New PDFXOCR.PDFXOCR_Funcs.PXO_Options()
Options.blacklist = ""
Options.whitelist = ""
Options.raster_dpi = m_DPI
Options.ImageFlags = CUInt(PDFXOCR.PDFXOCR_Funcs.OCR_ImageProcessingFlags.OCR_Image_FastAutorotate)
Options.DataPath = m_Datapath
Options.lang = m_Language
Options.RegionMode = PDFXOCR.PDFXOCR_Funcs.OCR_RegionMode.OCR_Auto
Options.reserved = 0

Dim pxoPagelist As IntPtr = IntPtr.Zero
' null pointer passed to OCR_MakeSearchable() will result in all pages being OCRd.
hResult = PDFXOCR.PDFXOCR_Funcs.OCR_MakeSearchable(pdf, Options, pxoPagelist)

If PDFXOCR.PDFXOCR_Funcs.IS_DS_FAILED(hResult) Then
MessageBox.Show("Error running searchable." & vbLf & "Error code: " + hResult.ToString())
Return
Else
OCRretcode = hResult.ToString()
End If

SavePath = System.IO.Path.Combine(m_DestFilename, sFileName)
hResult = PDFXOCR.PDFXOCR_Funcs.OCR_SaveW(pdf, SavePath)

If PDFXOCR.PDFXOCR_Funcs.IS_DS_FAILED(hResult) Then
MessageBox.Show("Error saving output PDF file." & vbLf & "Error code: " + hResult.ToString())
Return
Else
Dim outputstring As String
outputstring = (Convert.ToString((Convert.ToString("File saved to: ") & m_DestFilename) + " (OCR_MakeSearchable returned: ") & OCRretcode) + ")"
textBox1.Text = outputstring
textBox1.Update()
End If
PDFXOCR.PDFXOCR_Funcs.OCR_Delete(pdf)
lblStatus2.Text = "OCR complete"

Post by **Will - Tracker Supp** » Fri May 23, 2014 8:15 pm

Hi docu-track99,

Thanks for that - I'll make sure it's passed along.

Cheers,

docu-track99 · Post by **docu-track99** » Wed May 28, 2014 2:44 pm

Hello Tracker,

Is there any update on this issue yet?

Post by **John - Tracker Supp** » Wed May 28, 2014 6:05 pm

Not at this time I am afraid - can we ask for a 'before and after' PDF example that is not too big to allow us to analyse and investigate further please.

docu-track99 · Post by **docu-track99** » Thu May 29, 2014 8:12 pm

Hello Tracker,

Here are the links:

Before: http://doc-it.ftpstream.com/158159/ded1 ... JBIBLE.pdf

After: http://doc-it.ftpstream.com/158159/ab4e ... KJBIBLE.7z

This happens using the code provided before. Please let us know if you need any further information.

docu-track99 · Post by **docu-track99** » Thu May 29, 2014 8:45 pm

Hello Tracker,

So, our main question here is in the PDF Editor -> OCR Pages option, the process only takes abut 12 mins and the output file size increases slightly to 2.8 MBs. On the other hand, by using the code written above also provided by Tracker ,whole process takes about 1 hour and size of the output file increases significantly to 890MBs.

Post by **John - Tracker Supp** » Fri May 30, 2014 8:13 pm

Thanks - we are investigating using the info supplied and will come back.

Fri May 30, 2014 8:59 pm

Can you provide a screenshot of Editor's OCR page dialog before doing the OCR. Looks like in the Editor you have different settings than are used in the SDK

docu-track99 · Post by **docu-track99** » Fri May 30, 2014 9:45 pm

Hello Tracker,

Attached is the image requested. Please let us know if any code changes are required.

docu-track99 · Post by **docu-track99** » Tue Jun 03, 2014 1:26 pm

Hello Tracker,

Just following up with this ticket. Please let me know if you find the cause of size increase.

docu-track99 · Post by **docu-track99** » Fri Jun 06, 2014 1:09 pm

Hello Tracker,

I am following up with this ticket. Please let us know your findings.

Fri Jun 06, 2014 5:53 pm

When you are using the OCR SDK it rasterizes each page to an image (at 300 dpi), and adds this image to the new document removing the old content.

So, you will have image based document with an invisible text layer. And, because your document contains many pages, the resulting PDF is big.

When you use the OCR feature in the viewer, you use a different method: the page is rasterized, resulting image is used by OCR process, then an invisible text layer is added to the document, but not the image - the original content is retained as is.

It is much faster and resulting PDF is much smaller.

We will need to revise this but it will take a little time I am afraid and may not be done before the release of the Editor SDK - which would make this potentially superfluous anyway ...

docu-track99 · Post by **docu-track99** » Fri Jun 06, 2014 7:54 pm

Hello Tracker,

Is there a way we can use PDF Viewer ActiveX Control to programmatically OCR a PDF? Secondly, if we need to wait for new SDK, any idea on the approximate time frame?

Post by **Paul - Tracker Supp** » Fri Jun 06, 2014 8:32 pm

Hi again Paul,

we have raised a support ticket around getting this change to the SDK. RT#2543: After OCR output file are 4x larger than Input files. At this point I can't speculate as to when it will be done however.

hth

Tom Princen · Post by **Tom Princen** » Tue Jun 07, 2016 9:54 am

Is this solved? Because I still have this problem?

Tue Jun 07, 2016 11:46 am

Hello Tom,

We are in the process of rewriting our OCR SDK, and when that is released - the above discussed problem should be resolved.

Regards,
Stefan

CharlesTug · Post by **CharlesTug** » Thu Aug 27, 2020 12:55 am

Output
Pop up asking user to specify location where the output PDF file is to be stored.
File stored in the path provided by the user.
Single SAPScript output in PDF format.

Converting Multiple SAP Script outputs into single PDF file.

Mon Aug 31, 2020 10:00 pm

I am sorry Charles, I am not following your comment.

Can you explain in any more detail what your point is?

After OCR output file are 4x larger than Input files.

After OCR output file are 4x larger than Input files.

Re: After OCR output file are 4x larger than Input files.

Re: After OCR output file are 4x larger than Input files.

Re: After OCR output file are 4x larger than Input files.

Re: After OCR output file are 4x larger than Input files.

Re: After OCR output file are 4x larger than Input files.

Re: After OCR output file are 4x larger than Input files.

Re: After OCR output file are 4x larger than Input files.

Re: After OCR output file are 4x larger than Input files.

Re: After OCR output file are 4x larger than Input files.

Re: After OCR output file are 4x larger than Input files.

Re: After OCR output file are 4x larger than Input files.

Re: After OCR output file are 4x larger than Input files.

Re: After OCR output file are 4x larger than Input files.

Re: After OCR output file are 4x larger than Input files.

Re: After OCR output file are 4x larger than Input files.

Re: After OCR output file are 4x larger than Input files.

Re: After OCR output file are 4x larger than Input files.

After OCR output file are 4x larger than Input files

Re: After OCR output file are 4x larger than Input files.