Page 1 of 1

OCR of pdf and pictures

Posted: Sat Jan 16, 2016 1:51 am
by crimsonlogic
We bought Pro SDK license under CrimsonLogic Pte Ltd.

I have 3 problems now while doing OCR in my WPF application.

1) I am not able to OCR pdf with 17 pages and above.

2) I notice that some successfully OCRed files have text overlaid as in attached screenshot. How can I fix it?

3) When I convert image to pdf, the image size is quite small compared to original image. Where can I change the image size?
I’ve played around with the last 2 values in below line but I couldn’t manage to make the image bigger in pdf file.
PDFXC_Funcs.PXC_PlaceImage(cpage, p, Common.I2L(1), Common.PH - Common.I2L(1), Common.I2L(3), Common.I2L(2));

Please help to advise. Thank you very much.

Re: OCR of pdf and pictures

Posted: Mon Jan 18, 2016 2:11 pm
by John - Tracker Supp
Hi,

Can we please keep all OCR related questions in one forum - or email please - you are posting in multiple forums and also then sending emails - which is not helpful and just divides the effort to assist you as we are having to check if some items have been answered in emails or other forums first ...

I will move this one to the OCR forums and any others - so we can address them all logically - thank you.

Re: OCR of pdf and pictures

Posted: Mon Jan 18, 2016 2:17 pm
by John - Tracker Supp
RE: Questions;

1) I am not able to OCR pdf with 17 pages and above.

Please advise what version of our products are being used, the spec of the hardware (processor, drive space and also Ram, OS) Also please provide an example of the PDF being OCR'd - could it be you are running out of resources ??? Perhaps try breaking the job into 'chunks'

2) I notice that some successfully OCRed files have text overlaid as in attached screenshot. How can I fix it?

Please supply before/after PDF files for us to analyse along with a snippet of the code you are using for this specific task.

3) When I convert image to pdf, the image size is quite small compared to original image. Where can I change the image size?
I’ve played around with the last 2 values in below line but I couldn’t manage to make the image bigger in pdf file.
PDFXC_Funcs.PXC_PlaceImage(cpage, p, Common.I2L(1), Common.PH - Common.I2L(1), Common.I2L(3), Common.I2L(2));

I have asked a colleague to help and advise on this specifically...

Re: OCR of pdf and pictures

Posted: Tue Jan 19, 2016 2:14 am
by crimsonlogic
Hi John,

1) I am not able to OCR pdf with 17 pages and above.
>> We bought the license of PDF Xchange PRO SDK
>> On your website it shows
**NEW OCR Module Included** - Now includes PDF-X OCR SDK Module for converting image based PDF files to fully text searchable PDF files at no charge. For more information on this exciting new module and usage requirements for the free new add-on please visit our PDF-X OCR SDK Module page
>> We are using this PDF-X OCR SDK.
>> machine : 8 GB ram, I7, 64Bit OS.
>> Attached the pdf of 17 pages where you can try to OCR and update us on the outcome.
>> (please note that this 17 pages PDF was converted from word doc as your forum does not allow upload)
>> (let us know if you need the word copy to email to you.)
>> please see the code below.

2) I notice that some successfully OCRed files have text overlaid as in attached screenshot. How can I fix it?
>> Attached the pdf for your investigation. Please go through the pdf to see the issue.
>> ( Provide the program file on the OCR code)

3) When I convert image to pdf, the image size is quite small compared to original image. Where can I change the image size?
I’ve played around with the last 2 values in below line but I couldn’t manage to make the image bigger in pdf file.
PDFXC_Funcs.PXC_PlaceImage(cpage, p, Common.I2L(1), Common.PH - Common.I2L(1), Common.I2L(3), Common.I2L(2));
>> This we will wait for your feedback.

>> The code for OCR pdf.
private string ConvertPDFToOCR(string m_SourceFilename, string m_DestFilename, string language)
{
string result = "OK";
IntPtr pdf;
int hResult;
string OCRretcode;
int m_DPI;
string m_Datapath = Path.GetDirectoryName(Assembly.GetExecutingAssembly().GetName().CodeBase).Replace("file:\\", "") + @"\OCRLanguages\";

PDFXOCR_Funcs.PXO_Language m_Language = (PDFXOCR_Funcs.PXO_Language)Array.IndexOf(PDFXOCR_Funcs.OCR_LangFullArrayW, language); //GetOCRLanguage(language);

string langinit = PDFXOCR_Funcs.OCR_LangArrayW[Array.IndexOf(PDFXOCR_Funcs.OCR_LangFullArrayW, language)];

// Check if language file exists
string langfile = m_Datapath + @"ocrdats\" + langinit + "_pxvocr.dat";// m_Datapath + @"ocrdats\eng_pxvocr.dat"; //OCR Language file

// string err = string.Empty;

try
{
if (!System.IO.File.Exists(langfile))
{
result += "Language File Missing";
}
m_DPI = 200; //quality of OCR

string regkey = "XXXXXXXXXXXXXXXXXXXXXXX";
string devcode = "XXXXXXXXXXXXXXXXXXXXXXX";

//string key = "YOUR PRODUCT KEY";
//string code = "YOUR DEVELOPER CODE";
hResult = PDFXOCR_Funcs.OCR_Init(out pdf, regkey, devcode);

if (PDFXOCR_Funcs.IS_DS_FAILED(hResult))
{
result += "OCR Initialization failure.";
}

hResult = PDFXOCR_Funcs.OCR_SetCallback(pdf, thecallback, 0);

hResult = PDFXOCR_Funcs.OCR_LoadW(pdf, m_SourceFilename);
if (PDFXOCR_Funcs.IS_DS_FAILED(hResult))
{
result += "Error loading file: \n" + m_SourceFilename + "OCR Library Error";
}

PDFXOCR_Funcs.PXO_Options Options = new PDFXOCR_Funcs.PXO_Options();
Options.blacklist = string.Empty;
Options.whitelist = string.Empty;
Options.raster_dpi = m_DPI;
Options.ImageFlags = (uint)PDFXOCR_Funcs.OCR_ImageProcessingFlags.OCR_Image_FastAutorotate;
Options.DataPath = m_Datapath;
Options.lang = m_Language;
Options.RegionMode = PDFXOCR_Funcs.OCR_RegionMode.OCR_Auto;
Options.reserved = 0;

IntPtr pxoPagelist = IntPtr.Zero; // null pointer passed to OCR_MakeSearchable() will result in all pages being OCRd.

hResult = PDFXOCR_Funcs.OCR_MakeSearchable(pdf, ref Options, pxoPagelist);

if (PDFXOCR_Funcs.IS_DS_FAILED(hResult))
{
result += "Error running searchable.\nError code: " + hResult.ToString();
}
else
{
OCRretcode = hResult.ToString();
}

hResult = PDFXOCR_Funcs.OCR_SaveW(pdf, m_DestFilename);
if (PDFXOCR_Funcs.IS_DS_FAILED(hResult))
{
result += "Error saving output PDF file.\nError code: " + hResult.ToString();
}
PDFXOCR_Funcs.OCR_Delete(out pdf);
}
catch (Exception ex)
{
//throw ex;
result += "[EXCEPTION]" + ex.GetType();
result += "[EXCEPTION]" + ex.Message;
result += "[EXCEPTION]" + ex.StackTrace;
//Dispose();
//result += "Disposed OCRHelper class";
}
return result;
}

>> The code of Convert Word to PDF
private bool ConvertToPDF(string pdfpath, string inputfile)
{

bool isDone = false;
PXCComLib5.CPXCPrinter PDFPrinter;
PXCComLib5.CPXCControlEx prnFactory = new PXCComLib5.CPXCControlEx();
string regkey = "XXXXXXXXXXXX";
string devcode = "XXXXXXXXXXXX";
PDFPrinter = (PXCComLib5.CPXCPrinter)prnFactory.get_Printer("", "PDF-XChange Printer 2012", regkey, devcode);
PDFPrinter.Option["Save.ShowSaveDialog"] = false;
PDFPrinter.Option["Save.RunApp"] = false;
PDFPrinter.Option["Save.Path"] = pdfpath;
PDFPrinter.Option["Save.WhenExists"] = 1; //overwrite

PDFPrinter.SetAsDefaultPrinter();


System.Diagnostics.Process printJob = new System.Diagnostics.Process();
printJob.StartInfo.FileName = inputfile;
printJob.StartInfo.UseShellExecute = true;
printJob.StartInfo.Verb = "print";
printJob.StartInfo.WindowStyle = System.Diagnostics.ProcessWindowStyle.Minimized;
printJob.Start();
printJob.WaitForExit();
isDone = true;
return isDone;
}

Re: OCR of pdf and pictures

Posted: Tue Jan 19, 2016 7:41 am
by Lzcat - Tracker Supp
Hi.
3) When I convert image to pdf, the image size is quite small compared to original image. Where can I change the image size?
I’ve played around with the last 2 values in below line but I couldn’t manage to make the image bigger in pdf file.
PDFXC_Funcs.PXC_PlaceImage(cpage, p, Common.I2L(1), Common.PH - Common.I2L(1), Common.I2L(3), Common.I2L(2));
If you read help for PXC_PlaceImage function you can see that the last two parameters specify width and height of an image in points (1/72 inch). I cannot see code of your I2L function, so cannot say why you are getting such small images - because of the error in I2L or because 3 and 2 values are simply too small.
HTH.

Re: OCR of pdf and pictures

Posted: Tue Jan 19, 2016 7:51 am
by Sasha - Tracker Dev Team
Hello crimsonlogic,

As for the error code - it means OCR_ERR_INVALID_DICT_PATH meaning that you gave wrong path to the dictionary folder.

Do use these for problem investigating in future:

Code: Select all

OCRCORE_API LONG OCR_API OCRE_Err_FormatSeverity(HRESULT errorcode, LPSTR buf, LONG maxlen);
OCRCORE_API LONG OCR_API OCRE_Err_FormatFacility(HRESULT errorcode, LPSTR buf, LONG maxlen);
OCRCORE_API LONG OCR_API OCRE_Err_FormatErrorCode(HRESULT errorcode, LPSTR buf, LONG maxlen);
HTH,
Alex

Re: OCR of pdf and pictures

Posted: Tue Jan 19, 2016 9:14 am
by crimsonlogic
Hi Sasha,

Sorry, don't quite understand. which error code you are referring to??

Thanks

Re: OCR of pdf and pictures

Posted: Tue Jan 19, 2016 9:32 am
by Sasha - Tracker Dev Team
Hello crimsonlogic,

It's about the error code that you've asked about ERROR CODE – 2113263855 == 0x820A2711

HTH

Re: OCR of pdf and pictures

Posted: Tue Jan 19, 2016 9:04 pm
by Sasha - Tracker Dev Team
By the way - it would be better if you could provide a small sample project (with your dlls included) where the problems occur and the guide on how to reproduce them. Then we could help you more efficiently. Because right now there are many questions from our side which could be answered if we had a working project.

Re: OCR of pdf and pictures

Posted: Wed Jan 20, 2016 9:23 am
by crimsonlogic
Hi Sasha,

We will email you a sample program and documents to try out via email (support@tracker-software.com) due to file size limitation in file attachment in this forum. We will send them in 2 separate emails. Thanks for your help.

Re: OCR of pdf and pictures

Posted: Wed Jan 20, 2016 9:30 am
by crimsonlogic
Hi Sasha,

We've tried to send you the programs and sample files via email but failed to send due to the file size. Do you have any other alternative way to deposit our files? Thanks.

Re: OCR of pdf and pictures

Posted: Wed Jan 20, 2016 9:44 am
by John - Tracker Supp
How big are the attachments ?

Re: OCR of pdf and pictures

Posted: Wed Jan 20, 2016 9:51 am
by crimsonlogic
Program file is about 25MB and sample files are about 4MB after zipping

Re: OCR of pdf and pictures

Posted: Wed Jan 20, 2016 9:58 am
by Sasha - Tracker Dev Team
Please post them to google drive or dropbox and give us a link.

Cheers,
Alex

Re: OCR of pdf and pictures

Posted: Wed Jan 20, 2016 10:32 am
by crimsonlogic
Hi Sasha,

Our client is a government agency and they prohibit us to upload their code to cloud due to security concern.

Please help us to provide a secured repository to upload the files. Thank you very much.

Re: OCR of pdf and pictures

Posted: Wed Jan 20, 2016 10:40 am
by Tracker Supp-Stefan
Hello crimsonlogic,

Maybe you can upload the files to our ftp server?
You can find the details for it here:
http://www.tracker-software.com/knowledgebase/321
However as the FTP is open to anyone - we would recommend you to password protect the files uploaded, and then send us the password e.g. via e-mail to support@tracker-software.com

Regards,
Stefan

Re: OCR of pdf and pictures

Posted: Thu Jan 21, 2016 3:30 am
by crimsonlogic
Hi Stefan,

Thank you for your reply. We have uploaded the files and sent password in email.

Re: OCR of pdf and pictures

Posted: Thu Jan 21, 2016 7:30 am
by Sasha - Tracker Dev Team
Hello crimsonlogic,

Thanks for the sample - we'll look at it.

Re: OCR of pdf and pictures

Posted: Fri Jan 22, 2016 1:36 am
by crimsonlogic
Hi Sasha,

Any updates??

Thanks

Re: OCR of pdf and pictures

Posted: Fri Jan 22, 2016 11:12 am
by Sasha - Tracker Dev Team
Hello crimsonlogic,

Looking at your files in media.zip we've investigated this so far:
The DWC.pdf created had been already OCR'd by some external converter (libtiff / tiff2pdf - 2.3.606.0) with the text overlay that has invisible text.

When this file is OCR'd the text becomes visible and the background image + this text is going through our OCR engine. Thus you'll have the visible text (aligned by top in you example) and the OCR'd image background with the invisible text on top of it. Of course this text will be corrupted where it was overlayed with previously invisible text.

HTH,
Alex

Re: OCR of pdf and pictures

Posted: Mon Jan 25, 2016 8:21 am
by crimsonlogic
HI Sasha,

is it possible to know if the file has already been OCR when pass through PDF Xchange SDK?

Any updates on the other issue?


Thanks
fya

Re: OCR of pdf and pictures

Posted: Mon Jan 25, 2016 8:28 am
by Sasha - Tracker Dev Team
Hello crimsonlogic,

Maybe it's better to look at the pdf generator and it's options so that it won't generate any text?

Do you mean the 17 page problem as the other problem?

Cheers,
Alex

Re: OCR of pdf and pictures

Posted: Tue Jan 26, 2016 2:45 am
by crimsonlogic
HI Sasha,

yes, we need the solution of the 17 pages error.


Thanks
fya

Re: OCR of pdf and pictures

Posted: Tue Jan 26, 2016 2:52 am
by crimsonlogic
HI Sasha,

Don't understand your statement

Maybe it's better to look at the pdf generator and it's options so that it won't generate any text?

The PDF program given performs OCR which causes the overlay. What do you mean by the PDF generator??

The other issue is a word file, convert to PDF format and the OCR.
The convert to PDF format has no issue.
Where as the OCR process throws error.
Please try the program as we take effort to build to show the issue.
Please get the developer to look at the codes if you are not able to do so.

We need the solution ASAP as we are already reported the issues for over a week with no progress.

thanks
fya

Re: OCR of pdf and pictures

Posted: Tue Jan 26, 2016 7:54 am
by Ivan - Tracker Software
yes, we need the solution of the 17 pages error.
As we already mentioned, the problem is because your process is 32-bit.
32-bit processes have limited address space available, and, what is most important, in modern OSes Address Space Layout Randomization (https://en.wikipedia.org/wiki/Address_s ... domization) technology makes this address space highly fragmented and application often cannot allocate big continues buffer of memory (for example, one Letter page on 300 dpi requires about 32 Mb of memory on rasterization).
The only possible solutions I can recommend here:
1. create separate .exe that will OCR document and turn off ASLR for this .exe (not sure in .NET allows to do that)
2. convert your app to 64-bits.

HTH

Re: OCR of pdf and pictures

Posted: Tue Feb 02, 2016 4:49 am
by crimsonlogic
Hi,

As Alex said above, overlaid text is due to the pdf we use has been already OCRed. How can we know whether the pdf is already OCRed?

We have another problem in converting word file to pdf. Our code is as follow:

Firstly, we opened one word document (doc1.docx). Then, launch our application and upload another word document (doc2.docx) which will run below code to convert to PDF. Default printer is set to physical printer.

The below code still uses physical printer instead of using PDF-Xchange Printer. doc2.docx is printed out from physical printer instead of getting converted to PDF. Please advise us ASAP as this issue is stopping business flows for our live system.


PDFPrinter = (PXCComLib5.CPXCPrinter)prnFactory.get_Printer("", "PDF-XChange Printer 2012", regkey, devcode);
PDFPrinter.Option["Save.ShowSaveDialog"] = false;
PDFPrinter.Option["Save.RunApp"] = false;
PDFPrinter.Option["Save.Path"] = pdfpath;
PDFPrinter.Option["Save.WhenExists"] = 1; //overwrite

PDFPrinter.SetAsDefaultPrinter();

System.Diagnostics.Process printJob = new System.Diagnostics.Process();
printJob.StartInfo.FileName = inputfile;
printJob.StartInfo.UseShellExecute = true;
printJob.StartInfo.Verb = "print";
printJob.StartInfo.WindowStyle = System.Diagnostics.ProcessWindowStyle.Minimized;
printJob.Start();
printJob.WaitForExit(60000);

PDFPrinter.RestoreDefaultPrinter();

Re: OCR of pdf and pictures

Posted: Tue Feb 02, 2016 8:39 am
by Sasha - Tracker Dev Team
Hello crimsonlogic,

We suspect that this is a Windows 10 issue.
Do try this - we've just tested this code and it worked for us:

Code: Select all

            PXCComLib5.CPXCPrinter PDFPrinter;
            PXCComLib5.CPXCControlEx prnFactory = new PXCComLib5.CPXCControlEx();

            PDFPrinter = (PXCComLib5.CPXCPrinter)prnFactory.get_Printer("", "PDF-XChange Printer 2012", regkey, devcode);
            PDFPrinter.Option["Save.ShowSaveDialog"] = false;
            PDFPrinter.Option["Save.RunApp"] = false;
            PDFPrinter.Option["Save.Path"] = ocrfile;
            PDFPrinter.Option["Save.WhenExists"] = 1; //overwrite

            System.Diagnostics.Process printJob = new System.Diagnostics.Process();
            printJob.StartInfo.FileName = inputfile;
            printJob.StartInfo.UseShellExecute = true;
            printJob.StartInfo.Verb = "printto";
            printJob.StartInfo.Arguments = "\"" + PDFPrinter.Name + "\"";
            printJob.StartInfo.WindowStyle = System.Diagnostics.ProcessWindowStyle.Minimized;
            printJob.Start();
            printJob.WaitForExit(60000);

            return "ok";
HTH

Re: OCR of pdf and pictures

Posted: Wed Feb 17, 2016 10:25 am
by crimsonlogic
Hi Support,

I converted my application to 64bit according to Tracker's advice.
I am not able to convert image files to pdf. I've replaced all dlls from Bin.64 folders from Tracker Software\PDF-XChange PRO 5 SDK\Examples
Our code is as follows:
if (Common.IS_DS_FAILED(PDFXC_Funcs.PXC_NewDocument(out pdf, regkey, devcode)))
resultstr += "ConvertOthersToOCR: IS_DS_FAILED";
PDFXC_Funcs.PXC_SetDocumentInfoA(pdf, PDFXC_Funcs.PXC_StdInfoField.InfoField_Author, "Tracker Software");
PDFXC_Funcs.PXC_SetDocumentInfoA(pdf, PDFXC_Funcs.PXC_StdInfoField.InfoField_Title, "PDF-XChange 4.0 Examples");
PDFXC_Funcs.PXC_SetDocumentInfoA(pdf, PDFXC_Funcs.PXC_StdInfoField.InfoField_Creator, "PDF-XChange 4.0");
PDFXC_Funcs.PXC_SetDocumentInfoA(pdf, PDFXC_Funcs.PXC_StdInfoField.InfoField_Keywords, "PDF-XChange; Examples; 4.0; C#");
PDFXC_Funcs.PXC_EnableLinkAnalyzer(pdf, true);
PDFXC_Funcs.PXC_SetCompression(pdf, false, false, PDFXC_Funcs.PXC_CompressionType.ComprType_C_Auto,
75, PDFXC_Funcs.PXC_CompressionType.ComprType_I_Auto, PDFXC_Funcs.PXC_CompressionType.ComprType_M_Auto);


int res = PDFXC_Funcs.PXC_AddPage(pdf, Common.PW, Common.PH, out page);
if (Common.IS_DS_FAILED(res))
resultstr += "ConvertOthersToOCR: " + res;
cpage = page;

double iw, ih;
res = PDFXC_Funcs.PXC_AddImageA(pdf, inputfile, out p);
if (Common.IS_DS_FAILED(res))
resultstr += "ConvertOthersToOCR: " + res;
PDFXC_Funcs.PXC_GetImageDimension(pdf, p, out iw, out ih);
PDFXC_Funcs.PXC_PlaceImage(cpage, p, Common.I2L(1), Common.PH - Common.I2L(1), Common.I2L(7), Common.I2L(8));

PDFXC_Funcs.PXC_WriteDocumentExA(pdf, extractfile, extractfile.Length, fl, "");
PDFXC_Funcs.PXC_ReleaseDocument(pdf);

I am getting this error code -2113667071 from below line and no pdf is generated.

res = PDFXC_Funcs.PXC_AddImageA(pdf, inputfile, out p);

Please advise.

Thank you very much.

Re: OCR of pdf and pictures

Posted: Wed Feb 17, 2016 10:58 am
by Sasha - Tracker Dev Team
Hello crimsonlogic,

Please do not post error codes only - use PXC_Err_FormatErrorCode method.
The error code that you've provided means Invalid Argument.
The code sample does not contain enough information for that method.
Please provide samples with FULL problem data.

Re: OCR of pdf and pictures

Posted: Wed Feb 17, 2016 11:30 am
by crimsonlogic
Hi Sasha,

We are uploading sample project (TestPDFXChangeORG.zip) to Tracker's FTP . Please unzip with the password sent in a separate email to 'support@tracker-software.com'

The sample data file (CL.TIF) is in Temp.zip.

Please advise how we can use PXC_Err_FormatErrorCode in our program too.

Thank you very much.

Re: OCR of pdf and pictures

Posted: Wed Feb 17, 2016 3:18 pm
by Sasha - Tracker Dev Team
How to use FormatErrorCode method:

Code: Select all

					byte[] bytes = new byte[128 * sizeof(char)];
					PDFXC_Funcs.PXC_Err_FormatErrorCode(-2113667071, bytes, bytes.Length);
					string str = System.Text.Encoding.ASCII.GetString(bytes);
Please post the error message with the error code itself when you need to include it in your message.

Cheers,
Alex

Re: OCR of pdf and pictures

Posted: Thu Feb 18, 2016 12:42 pm
by Sasha - Tracker Dev Team
Hello crimsonlogic,

I've updated the zip archive ClassLibrary1.zip with the same password that you've specified.
The problem was in the int type - C# understands int as the 32 bit value thus when you switched to the x64 the pointers that were used became corrupted. I've modified them to IntPtr and it all worked properly.
In the archive there are files that I modified.

HTH,
Alex