After OCR output file are 4x larger than Input files.
Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Tracker Supp-Stefan
-
- User
- Posts: 518
- Joined: Thu Dec 06, 2007 8:13 pm
After OCR output file are 4x larger than Input files.
Hello Tracker,
I am using OCRTools.dll in my custom OCR application. Version number is 1.0.14.1. After file get processed they get 4x larger than their actual size. Could you please tell me what can possibly cause this? I can also show you my code if required.
I am using OCRTools.dll in my custom OCR application. Version number is 1.0.14.1. After file get processed they get 4x larger than their actual size. Could you please tell me what can possibly cause this? I can also show you my code if required.
- Will - Tracker Supp
- Site Admin
- Posts: 6815
- Joined: Mon Oct 15, 2012 9:21 pm
- Location: London, UK
- Contact:
Re: After OCR output file are 4x larger than Input files.
Hi docutrack99,
Thanks for the post - I've passed this along to the Dev Team. In the meantime, could you send us the code?
Cheers,
Thanks for the post - I've passed this along to the Dev Team. In the meantime, could you send us the code?
Cheers,
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.
Best regards
Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
Thank you.
Best regards
Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
-
- User
- Posts: 518
- Joined: Thu Dec 06, 2007 8:13 pm
Re: After OCR output file are 4x larger than Input files.
Hello Tracker,
Below is the main portion of code which uses OCRTools.dll
hResult = PDFXOCR.PDFXOCR_Funcs.OCR_Init(pdf, key, code)
If PDFXOCR.PDFXOCR_Funcs.IS_DS_FAILED(hResult) Then
lblStatus2.Text = "Error"
MessageBox.Show("OCR Initialization failure.")
End If
hResult = PDFXOCR.PDFXOCR_Funcs.OCR_SetCallback(pdf, AddressOf thecallback, 0)
lblStatus2.Text = "OCR saving output..."
SavePath1 = System.IO.Path.Combine(m_SourceFilename, sFileName)
hResult = PDFXOCR.PDFXOCR_Funcs.OCR_LoadW(pdf, SavePath1)
If PDFXOCR.PDFXOCR_Funcs.IS_DS_FAILED(hResult) Then
MessageBox.Show(Convert.ToString("Error loading file: " & vbLf) & sFileName, "OCR Library Error")
Return
End If
Dim Options As New PDFXOCR.PDFXOCR_Funcs.PXO_Options()
Options.blacklist = ""
Options.whitelist = ""
Options.raster_dpi = m_DPI
Options.ImageFlags = CUInt(PDFXOCR.PDFXOCR_Funcs.OCR_ImageProcessingFlags.OCR_Image_FastAutorotate)
Options.DataPath = m_Datapath
Options.lang = m_Language
Options.RegionMode = PDFXOCR.PDFXOCR_Funcs.OCR_RegionMode.OCR_Auto
Options.reserved = 0
Dim pxoPagelist As IntPtr = IntPtr.Zero
' null pointer passed to OCR_MakeSearchable() will result in all pages being OCRd.
hResult = PDFXOCR.PDFXOCR_Funcs.OCR_MakeSearchable(pdf, Options, pxoPagelist)
If PDFXOCR.PDFXOCR_Funcs.IS_DS_FAILED(hResult) Then
MessageBox.Show("Error running searchable." & vbLf & "Error code: " + hResult.ToString())
Return
Else
OCRretcode = hResult.ToString()
End If
SavePath = System.IO.Path.Combine(m_DestFilename, sFileName)
hResult = PDFXOCR.PDFXOCR_Funcs.OCR_SaveW(pdf, SavePath)
If PDFXOCR.PDFXOCR_Funcs.IS_DS_FAILED(hResult) Then
MessageBox.Show("Error saving output PDF file." & vbLf & "Error code: " + hResult.ToString())
Return
Else
Dim outputstring As String
outputstring = (Convert.ToString((Convert.ToString("File saved to: ") & m_DestFilename) + " (OCR_MakeSearchable returned: ") & OCRretcode) + ")"
textBox1.Text = outputstring
textBox1.Update()
End If
PDFXOCR.PDFXOCR_Funcs.OCR_Delete(pdf)
lblStatus2.Text = "OCR complete"
Below is the main portion of code which uses OCRTools.dll
hResult = PDFXOCR.PDFXOCR_Funcs.OCR_Init(pdf, key, code)
If PDFXOCR.PDFXOCR_Funcs.IS_DS_FAILED(hResult) Then
lblStatus2.Text = "Error"
MessageBox.Show("OCR Initialization failure.")
End If
hResult = PDFXOCR.PDFXOCR_Funcs.OCR_SetCallback(pdf, AddressOf thecallback, 0)
lblStatus2.Text = "OCR saving output..."
SavePath1 = System.IO.Path.Combine(m_SourceFilename, sFileName)
hResult = PDFXOCR.PDFXOCR_Funcs.OCR_LoadW(pdf, SavePath1)
If PDFXOCR.PDFXOCR_Funcs.IS_DS_FAILED(hResult) Then
MessageBox.Show(Convert.ToString("Error loading file: " & vbLf) & sFileName, "OCR Library Error")
Return
End If
Dim Options As New PDFXOCR.PDFXOCR_Funcs.PXO_Options()
Options.blacklist = ""
Options.whitelist = ""
Options.raster_dpi = m_DPI
Options.ImageFlags = CUInt(PDFXOCR.PDFXOCR_Funcs.OCR_ImageProcessingFlags.OCR_Image_FastAutorotate)
Options.DataPath = m_Datapath
Options.lang = m_Language
Options.RegionMode = PDFXOCR.PDFXOCR_Funcs.OCR_RegionMode.OCR_Auto
Options.reserved = 0
Dim pxoPagelist As IntPtr = IntPtr.Zero
' null pointer passed to OCR_MakeSearchable() will result in all pages being OCRd.
hResult = PDFXOCR.PDFXOCR_Funcs.OCR_MakeSearchable(pdf, Options, pxoPagelist)
If PDFXOCR.PDFXOCR_Funcs.IS_DS_FAILED(hResult) Then
MessageBox.Show("Error running searchable." & vbLf & "Error code: " + hResult.ToString())
Return
Else
OCRretcode = hResult.ToString()
End If
SavePath = System.IO.Path.Combine(m_DestFilename, sFileName)
hResult = PDFXOCR.PDFXOCR_Funcs.OCR_SaveW(pdf, SavePath)
If PDFXOCR.PDFXOCR_Funcs.IS_DS_FAILED(hResult) Then
MessageBox.Show("Error saving output PDF file." & vbLf & "Error code: " + hResult.ToString())
Return
Else
Dim outputstring As String
outputstring = (Convert.ToString((Convert.ToString("File saved to: ") & m_DestFilename) + " (OCR_MakeSearchable returned: ") & OCRretcode) + ")"
textBox1.Text = outputstring
textBox1.Update()
End If
PDFXOCR.PDFXOCR_Funcs.OCR_Delete(pdf)
lblStatus2.Text = "OCR complete"
- Will - Tracker Supp
- Site Admin
- Posts: 6815
- Joined: Mon Oct 15, 2012 9:21 pm
- Location: London, UK
- Contact:
Re: After OCR output file are 4x larger than Input files.
Hi docu-track99,
Thanks for that - I'll make sure it's passed along.
Cheers,
Thanks for that - I'll make sure it's passed along.
Cheers,
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.
Best regards
Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
Thank you.
Best regards
Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
-
- User
- Posts: 518
- Joined: Thu Dec 06, 2007 8:13 pm
Re: After OCR output file are 4x larger than Input files.
Hello Tracker,
Is there any update on this issue yet?
Is there any update on this issue yet?
- John - Tracker Supp
- Site Admin
- Posts: 5219
- Joined: Tue Jun 29, 2004 10:34 am
- Location: United Kingdom
- Contact:
Re: After OCR output file are 4x larger than Input files.
Not at this time I am afraid - can we ask for a 'before and after' PDF example that is not too big to allow us to analyse and investigate further please.
If posting files to this forum - you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded - thank you.
Best regards
Tracker Support
http://www.tracker-software.com
Best regards
Tracker Support
http://www.tracker-software.com
-
- User
- Posts: 518
- Joined: Thu Dec 06, 2007 8:13 pm
Re: After OCR output file are 4x larger than Input files.
Hello Tracker,
Here are the links:
Before: http://doc-it.ftpstream.com/158159/ded1 ... JBIBLE.pdf
After: http://doc-it.ftpstream.com/158159/ab4e ... KJBIBLE.7z
This happens using the code provided before. Please let us know if you need any further information.
Here are the links:
Before: http://doc-it.ftpstream.com/158159/ded1 ... JBIBLE.pdf
After: http://doc-it.ftpstream.com/158159/ab4e ... KJBIBLE.7z
This happens using the code provided before. Please let us know if you need any further information.
-
- User
- Posts: 518
- Joined: Thu Dec 06, 2007 8:13 pm
Re: After OCR output file are 4x larger than Input files.
Hello Tracker,
So, our main question here is in the PDF Editor -> OCR Pages option, the process only takes abut 12 mins and the output file size increases slightly to 2.8 MBs. On the other hand, by using the code written above also provided by Tracker ,whole process takes about 1 hour and size of the output file increases significantly to 890MBs.
So, our main question here is in the PDF Editor -> OCR Pages option, the process only takes abut 12 mins and the output file size increases slightly to 2.8 MBs. On the other hand, by using the code written above also provided by Tracker ,whole process takes about 1 hour and size of the output file increases significantly to 890MBs.
- John - Tracker Supp
- Site Admin
- Posts: 5219
- Joined: Tue Jun 29, 2004 10:34 am
- Location: United Kingdom
- Contact:
Re: After OCR output file are 4x larger than Input files.
Thanks - we are investigating using the info supplied and will come back.
If posting files to this forum - you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded - thank you.
Best regards
Tracker Support
http://www.tracker-software.com
Best regards
Tracker Support
http://www.tracker-software.com
- Ivan - Tracker Software
- Site Admin
- Posts: 3549
- Joined: Thu Jul 08, 2004 10:36 pm
- Location: Vancouver Island - Canada
- Contact:
Re: After OCR output file are 4x larger than Input files.
Can you provide a screenshot of Editor's OCR page dialog before doing the OCR. Looks like in the Editor you have different settings than are used in the SDK
Tracker Software (Project Director)
When attaching files to any message - please ensure they are archived and posted as a .ZIP, .RAR or .7z format - or they will not be posted - thanks.
When attaching files to any message - please ensure they are archived and posted as a .ZIP, .RAR or .7z format - or they will not be posted - thanks.
-
- User
- Posts: 518
- Joined: Thu Dec 06, 2007 8:13 pm
Re: After OCR output file are 4x larger than Input files.
Hello Tracker,
Attached is the image requested. Please let us know if any code changes are required.
Attached is the image requested. Please let us know if any code changes are required.
- Attachments
-
- ocr page dialog.zip
- Screenshot of OCR Page Dialog.
- (32.02 KiB) Downloaded 345 times
-
- User
- Posts: 518
- Joined: Thu Dec 06, 2007 8:13 pm
Re: After OCR output file are 4x larger than Input files.
Hello Tracker,
Just following up with this ticket. Please let me know if you find the cause of size increase.
Just following up with this ticket. Please let me know if you find the cause of size increase.
-
- User
- Posts: 518
- Joined: Thu Dec 06, 2007 8:13 pm
Re: After OCR output file are 4x larger than Input files.
Hello Tracker,
I am following up with this ticket. Please let us know your findings.
I am following up with this ticket. Please let us know your findings.
- Ivan - Tracker Software
- Site Admin
- Posts: 3549
- Joined: Thu Jul 08, 2004 10:36 pm
- Location: Vancouver Island - Canada
- Contact:
Re: After OCR output file are 4x larger than Input files.
When you are using the OCR SDK it rasterizes each page to an image (at 300 dpi), and adds this image to the new document removing the old content.
So, you will have image based document with an invisible text layer. And, because your document contains many pages, the resulting PDF is big.
When you use the OCR feature in the viewer, you use a different method: the page is rasterized, resulting image is used by OCR process, then an invisible text layer is added to the document, but not the image - the original content is retained as is.
It is much faster and resulting PDF is much smaller.
We will need to revise this but it will take a little time I am afraid and may not be done before the release of the Editor SDK - which would make this potentially superfluous anyway ...
So, you will have image based document with an invisible text layer. And, because your document contains many pages, the resulting PDF is big.
When you use the OCR feature in the viewer, you use a different method: the page is rasterized, resulting image is used by OCR process, then an invisible text layer is added to the document, but not the image - the original content is retained as is.
It is much faster and resulting PDF is much smaller.
We will need to revise this but it will take a little time I am afraid and may not be done before the release of the Editor SDK - which would make this potentially superfluous anyway ...
Tracker Software (Project Director)
When attaching files to any message - please ensure they are archived and posted as a .ZIP, .RAR or .7z format - or they will not be posted - thanks.
When attaching files to any message - please ensure they are archived and posted as a .ZIP, .RAR or .7z format - or they will not be posted - thanks.
-
- User
- Posts: 518
- Joined: Thu Dec 06, 2007 8:13 pm
Re: After OCR output file are 4x larger than Input files.
Hello Tracker,
Is there a way we can use PDF Viewer ActiveX Control to programmatically OCR a PDF? Secondly, if we need to wait for new SDK, any idea on the approximate time frame?
Is there a way we can use PDF Viewer ActiveX Control to programmatically OCR a PDF? Secondly, if we need to wait for new SDK, any idea on the approximate time frame?
- Paul - Tracker Supp
- Site Admin
- Posts: 6897
- Joined: Wed Mar 25, 2009 10:37 pm
- Location: Chemainus, Canada
- Contact:
Re: After OCR output file are 4x larger than Input files.
Hi again Paul,
we have raised a support ticket around getting this change to the SDK. RT#2543: After OCR output file are 4x larger than Input files. At this point I can't speculate as to when it will be done however.
hth
we have raised a support ticket around getting this change to the SDK. RT#2543: After OCR output file are 4x larger than Input files. At this point I can't speculate as to when it will be done however.
hth
Best regards
Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
-
- User
- Posts: 83
- Joined: Wed Mar 25, 2015 10:15 am
Re: After OCR output file are 4x larger than Input files.
Is this solved? Because I still have this problem?
- Tracker Supp-Stefan
- Site Admin
- Posts: 17906
- Joined: Mon Jan 12, 2009 8:07 am
- Location: London
- Contact:
Re: After OCR output file are 4x larger than Input files.
Hello Tom,
We are in the process of rewriting our OCR SDK, and when that is released - the above discussed problem should be resolved.
Regards,
Stefan
We are in the process of rewriting our OCR SDK, and when that is released - the above discussed problem should be resolved.
Regards,
Stefan
-
- User
- Posts: 2
- Joined: Sat Aug 15, 2020 3:43 am
- Location: Colombia
- Contact:
After OCR output file are 4x larger than Input files
Output
Pop up asking user to specify location where the output PDF file is to be stored.
File stored in the path provided by the user.
Single SAPScript output in PDF format.
Converting Multiple SAP Script outputs into single PDF file.
Pop up asking user to specify location where the output PDF file is to be stored.
File stored in the path provided by the user.
Single SAPScript output in PDF format.
Converting Multiple SAP Script outputs into single PDF file.
- Paul - Tracker Supp
- Site Admin
- Posts: 6897
- Joined: Wed Mar 25, 2009 10:37 pm
- Location: Chemainus, Canada
- Contact:
Re: After OCR output file are 4x larger than Input files.
I am sorry Charles, I am not following your comment.
Can you explain in any more detail what your point is?
Can you explain in any more detail what your point is?
Best regards
Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com