After OCR output file are 4x larger than Input files.

PDF-X OCR SDK is a New product from us and intended to compliment our existing PDF and Imaging Tools to provide the Developer with an expanding set of professional tools for Optical Character Recognition tasks

Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Tracker Supp-Stefan

Post Reply
docu-track99
User
Posts: 518
Joined: Thu Dec 06, 2007 8:13 pm

After OCR output file are 4x larger than Input files.

Post by docu-track99 »

Hello Tracker,

I am using OCRTools.dll in my custom OCR application. Version number is 1.0.14.1. After file get processed they get 4x larger than their actual size. Could you please tell me what can possibly cause this? I can also show you my code if required.
User avatar
Will - Tracker Supp
Site Admin
Posts: 6815
Joined: Mon Oct 15, 2012 9:21 pm
Location: London, UK
Contact:

Re: After OCR output file are 4x larger than Input files.

Post by Will - Tracker Supp »

Hi docutrack99,

Thanks for the post - I've passed this along to the Dev Team. In the meantime, could you send us the code?

Cheers,
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.

Best regards

Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
docu-track99
User
Posts: 518
Joined: Thu Dec 06, 2007 8:13 pm

Re: After OCR output file are 4x larger than Input files.

Post by docu-track99 »

Hello Tracker,

Below is the main portion of code which uses OCRTools.dll

hResult = PDFXOCR.PDFXOCR_Funcs.OCR_Init(pdf, key, code)

If PDFXOCR.PDFXOCR_Funcs.IS_DS_FAILED(hResult) Then
lblStatus2.Text = "Error"
MessageBox.Show("OCR Initialization failure.")
End If

hResult = PDFXOCR.PDFXOCR_Funcs.OCR_SetCallback(pdf, AddressOf thecallback, 0)
lblStatus2.Text = "OCR saving output..."

SavePath1 = System.IO.Path.Combine(m_SourceFilename, sFileName)
hResult = PDFXOCR.PDFXOCR_Funcs.OCR_LoadW(pdf, SavePath1)
If PDFXOCR.PDFXOCR_Funcs.IS_DS_FAILED(hResult) Then
MessageBox.Show(Convert.ToString("Error loading file: " & vbLf) & sFileName, "OCR Library Error")
Return
End If
Dim Options As New PDFXOCR.PDFXOCR_Funcs.PXO_Options()
Options.blacklist = ""
Options.whitelist = ""
Options.raster_dpi = m_DPI
Options.ImageFlags = CUInt(PDFXOCR.PDFXOCR_Funcs.OCR_ImageProcessingFlags.OCR_Image_FastAutorotate)
Options.DataPath = m_Datapath
Options.lang = m_Language
Options.RegionMode = PDFXOCR.PDFXOCR_Funcs.OCR_RegionMode.OCR_Auto
Options.reserved = 0

Dim pxoPagelist As IntPtr = IntPtr.Zero
' null pointer passed to OCR_MakeSearchable() will result in all pages being OCRd.
hResult = PDFXOCR.PDFXOCR_Funcs.OCR_MakeSearchable(pdf, Options, pxoPagelist)

If PDFXOCR.PDFXOCR_Funcs.IS_DS_FAILED(hResult) Then
MessageBox.Show("Error running searchable." & vbLf & "Error code: " + hResult.ToString())
Return
Else
OCRretcode = hResult.ToString()
End If

SavePath = System.IO.Path.Combine(m_DestFilename, sFileName)
hResult = PDFXOCR.PDFXOCR_Funcs.OCR_SaveW(pdf, SavePath)

If PDFXOCR.PDFXOCR_Funcs.IS_DS_FAILED(hResult) Then
MessageBox.Show("Error saving output PDF file." & vbLf & "Error code: " + hResult.ToString())
Return
Else
Dim outputstring As String
outputstring = (Convert.ToString((Convert.ToString("File saved to: ") & m_DestFilename) + " (OCR_MakeSearchable returned: ") & OCRretcode) + ")"
textBox1.Text = outputstring
textBox1.Update()
End If
PDFXOCR.PDFXOCR_Funcs.OCR_Delete(pdf)
lblStatus2.Text = "OCR complete"
User avatar
Will - Tracker Supp
Site Admin
Posts: 6815
Joined: Mon Oct 15, 2012 9:21 pm
Location: London, UK
Contact:

Re: After OCR output file are 4x larger than Input files.

Post by Will - Tracker Supp »

Hi docu-track99,

Thanks for that - I'll make sure it's passed along.

Cheers,
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.

Best regards

Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
docu-track99
User
Posts: 518
Joined: Thu Dec 06, 2007 8:13 pm

Re: After OCR output file are 4x larger than Input files.

Post by docu-track99 »

Hello Tracker,

Is there any update on this issue yet?
User avatar
John - Tracker Supp
Site Admin
Posts: 5219
Joined: Tue Jun 29, 2004 10:34 am
Location: United Kingdom
Contact:

Re: After OCR output file are 4x larger than Input files.

Post by John - Tracker Supp »

Not at this time I am afraid - can we ask for a 'before and after' PDF example that is not too big to allow us to analyse and investigate further please.
If posting files to this forum - you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded - thank you.

Best regards
Tracker Support
http://www.tracker-software.com
docu-track99
User
Posts: 518
Joined: Thu Dec 06, 2007 8:13 pm

Re: After OCR output file are 4x larger than Input files.

Post by docu-track99 »

Hello Tracker,

Here are the links:


Before: http://doc-it.ftpstream.com/158159/ded1 ... JBIBLE.pdf


After: http://doc-it.ftpstream.com/158159/ab4e ... KJBIBLE.7z


This happens using the code provided before. Please let us know if you need any further information.
docu-track99
User
Posts: 518
Joined: Thu Dec 06, 2007 8:13 pm

Re: After OCR output file are 4x larger than Input files.

Post by docu-track99 »

Hello Tracker,

So, our main question here is in the PDF Editor -> OCR Pages option, the process only takes abut 12 mins and the output file size increases slightly to 2.8 MBs. On the other hand, by using the code written above also provided by Tracker ,whole process takes about 1 hour and size of the output file increases significantly to 890MBs.
User avatar
John - Tracker Supp
Site Admin
Posts: 5219
Joined: Tue Jun 29, 2004 10:34 am
Location: United Kingdom
Contact:

Re: After OCR output file are 4x larger than Input files.

Post by John - Tracker Supp »

Thanks - we are investigating using the info supplied and will come back.
If posting files to this forum - you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded - thank you.

Best regards
Tracker Support
http://www.tracker-software.com
User avatar
Ivan - Tracker Software
Site Admin
Posts: 3549
Joined: Thu Jul 08, 2004 10:36 pm
Location: Vancouver Island - Canada
Contact:

Re: After OCR output file are 4x larger than Input files.

Post by Ivan - Tracker Software »

Can you provide a screenshot of Editor's OCR page dialog before doing the OCR. Looks like in the Editor you have different settings than are used in the SDK
Tracker Software (Project Director)

When attaching files to any message - please ensure they are archived and posted as a .ZIP, .RAR or .7z format - or they will not be posted - thanks.
docu-track99
User
Posts: 518
Joined: Thu Dec 06, 2007 8:13 pm

Re: After OCR output file are 4x larger than Input files.

Post by docu-track99 »

Hello Tracker,

Attached is the image requested. Please let us know if any code changes are required.
Attachments
ocr page dialog.zip
Screenshot of OCR Page Dialog.
(32.02 KiB) Downloaded 324 times
docu-track99
User
Posts: 518
Joined: Thu Dec 06, 2007 8:13 pm

Re: After OCR output file are 4x larger than Input files.

Post by docu-track99 »

Hello Tracker,

Just following up with this ticket. Please let me know if you find the cause of size increase.
docu-track99
User
Posts: 518
Joined: Thu Dec 06, 2007 8:13 pm

Re: After OCR output file are 4x larger than Input files.

Post by docu-track99 »

Hello Tracker,

I am following up with this ticket. Please let us know your findings.
User avatar
Ivan - Tracker Software
Site Admin
Posts: 3549
Joined: Thu Jul 08, 2004 10:36 pm
Location: Vancouver Island - Canada
Contact:

Re: After OCR output file are 4x larger than Input files.

Post by Ivan - Tracker Software »

When you are using the OCR SDK it rasterizes each page to an image (at 300 dpi), and adds this image to the new document removing the old content.

So, you will have image based document with an invisible text layer. And, because your document contains many pages, the resulting PDF is big.

When you use the OCR feature in the viewer, you use a different method: the page is rasterized, resulting image is used by OCR process, then an invisible text layer is added to the document, but not the image - the original content is retained as is.

It is much faster and resulting PDF is much smaller.

We will need to revise this but it will take a little time I am afraid and may not be done before the release of the Editor SDK - which would make this potentially superfluous anyway ...
Tracker Software (Project Director)

When attaching files to any message - please ensure they are archived and posted as a .ZIP, .RAR or .7z format - or they will not be posted - thanks.
docu-track99
User
Posts: 518
Joined: Thu Dec 06, 2007 8:13 pm

Re: After OCR output file are 4x larger than Input files.

Post by docu-track99 »

Hello Tracker,

Is there a way we can use PDF Viewer ActiveX Control to programmatically OCR a PDF? Secondly, if we need to wait for new SDK, any idea on the approximate time frame?
User avatar
Paul - Tracker Supp
Site Admin
Posts: 6836
Joined: Wed Mar 25, 2009 10:37 pm
Location: Chemainus, Canada
Contact:

Re: After OCR output file are 4x larger than Input files.

Post by Paul - Tracker Supp »

Hi again Paul,

we have raised a support ticket around getting this change to the SDK. RT#2543: After OCR output file are 4x larger than Input files. At this point I can't speculate as to when it will be done however.

hth
Best regards

Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
Tom Princen
User
Posts: 83
Joined: Wed Mar 25, 2015 10:15 am

Re: After OCR output file are 4x larger than Input files.

Post by Tom Princen »

Is this solved? Because I still have this problem?
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17824
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: After OCR output file are 4x larger than Input files.

Post by Tracker Supp-Stefan »

Hello Tom,

We are in the process of rewriting our OCR SDK, and when that is released - the above discussed problem should be resolved.

Regards,
Stefan
CharlesTug
User
Posts: 2
Joined: Sat Aug 15, 2020 3:43 am
Location: Colombia
Contact:

After OCR output file are 4x larger than Input files

Post by CharlesTug »

Output
Pop up asking user to specify location where the output PDF file is to be stored.
File stored in the path provided by the user.
Single SAPScript output in PDF format.

Converting Multiple SAP Script outputs into single PDF file.
User avatar
Paul - Tracker Supp
Site Admin
Posts: 6836
Joined: Wed Mar 25, 2009 10:37 pm
Location: Chemainus, Canada
Contact:

Re: After OCR output file are 4x larger than Input files.

Post by Paul - Tracker Supp »

I am sorry Charles, I am not following your comment.

Can you explain in any more detail what your point is?
Best regards

Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
Post Reply