Page 1 of 1

Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Tue Jan 16, 2018 5:12 am
by MartinCS
Hi Tracker-Team,

I'm using the following code to perform OCR recognizition of pdf documents (https://sdkhelp.pdf-xchange.com/vi ... t_OCRPages):

Code: Select all

var nId = pdfCtl.Inst.Str2ID("op.document.OCRPages", false);
var pOp = pdfCtl.Inst.CreateOp(nId);
var input = pOp.Params.Root["Input"];

input.v = pdfDocumentModel.PxvDocument.CoreDoc;

ICabNode options = pOp.Params.Root["Options"];
options["PagesRange.Type"].v = rangeType;
options["OutputType"].v = outputType;
options["OutputDPI"].v = outputDpi;

pdfCtl.Inst.AsyncDoAndWaitForFinish(pOp);
The code works withoutout problems. My question is, how can I get the searchable text of the new created layer? I have the need to save this searchable text in our backend database system.

Additionally, I'd like to ask how I can use the 'Languages' folder in order to get the best OCR recognizition results? I have put the folder in the same directory where the PDFXEdit dlls reside:
16-01-_2018_07-29-37.jpg
But I don't get the same OCR result when testing it with the standalone PDF Editor. In my contructor of my main form (which contains the pdf control) I do load the OCR plugin and the saved pdf document contains OCR text. So, I don't think there are no problems with the plugin.

// Martin

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Tue Jan 16, 2018 1:36 pm
by Sasha - Tracker Dev Team
Hello Martin,

The OCR operation calls https://sdkhelp.pdf-xchange.com/vi ... addContent operation inside of it for each page and it has a Content option. Basically you can take the text from there.
As for your second question, make sure, that the settings of both the End-User Editor and your sample are the same.

Cheers,
Alex

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Wed Jan 17, 2018 10:14 am
by MartinCS
Hello Alex,

Thank you for the information. Unfortunately, it's not clear to me how I can access the "addContent" function, as it is internal to the "OcrPages" function. Is there maybe an event that I've not seen before that would give me access to this stage of the code execution, or could you please elaborate on how I would go about getting access to the "Content" property?

Thank you!

// Martin

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Wed Jan 17, 2018 10:42 am
by Sasha - Tracker Dev Team
Hello Martin,

In this case the OCR Pages operation call the AddContent operations from inside of it. Thus you can listen to these inner operations and take their data.
You will have to listen to the https://sdkhelp.pdf-xchange.com/vi ... oreExecute event and see whether it's an OCR pages operation (mark this with some bool value). The e.operExecuted event will terminate that bool value if it's the OCR operation.
Then when this bool value is true listen to the operBeforeExecute events for addContent operations - this will indicate that these operations are being executed from the OCR operation. Then you can get the needed information from each of them.

Cheers,
Alex

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Thu Jan 18, 2018 1:10 pm
by MartinCS
Hi Alex,

thank you your help!

Regarding my second question. I'm not able to configure the language which should be used by default. In my implementation I'm using the SDK with the activeX control and I have put the OCR languages in the same folder where your dlls are residing:
18-01-_2018_13-51-36.jpg
The 'OCRLanguages' folder has all available language files included:
18-01-_2018_13-52-37.jpg
When my application loads I do load the OCR plugin (right after 'InitializeComponents'):

Code: Select all

public FrmMain()
{
	this.InitializeComponent();
	
	Instance.PxvInst.StartLoadingPlugins();
	PxvInst.AddPluginFromFile(
                Path.Combine(EplassConfiguration.FilePath, "Plugins.x86", "OCRPlugin.pvp"));
	PxvInst.FinishLoadingPlugins();
}
Right after that I import my *.xcs setting file in order/hoping to set the default languages which I have configured and exported via standalone Pdf Editor:
18-01-_2018_14-00-08.jpg
with this code:

Code: Select all

var op =pdfCtl.Inst.CreateOp(pdfCtl.Inst.Str2ID("op.settings.import"));

if (op == null)
{
    return;
}

op.Params.Root["Options.History"].v = false;

op.Params.Root["Input"].v = FsInst.DefaultFileSys.StringToName(filePathSettings);
op.Do();
After that I run the OCR function (see code in my very first post). I don't get any exceptions and everything seems to run without any problems. The new OCR recognized file is also created.

But I don't get the same text results within the searchable layer containing the text when I run it with the standalone Pdf Editor.

I hope you guys can help me!

// Martin

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Thu Jan 18, 2018 1:22 pm
by Sasha - Tracker Dev Team
Hello Martin,

Try this in the OCR operation:

Code: Select all

ICabNode options = pOp.Params.Root["Options"];
options["ExtParams.Language"].v = "deu+eng+fra+spa"; //separate the needed languages with +
options["ExtParams.Accuracy"].v = 300;
Cheers,
Alex

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Thu Jan 18, 2018 1:46 pm
by MartinCS
Hi Alex,

I added your code lines but it made no difference. Can I send you my pdf I'm using via pm or email? I just would need your email address.

The recognized text from the OCR function is:
Schon heule isl der an- und abschwellende Gülerzuglörm für uns sehr
slörend und hal bereils slark zugenommen. Die vorgelegle Planung eriülll níchl
einmal die geseizlíchen Grenzwene. Schienenlörm isl níchl harmloser als anderer Lörm,
der an- und abschwellende Lörm isl sogar schlimmer. Die „Millelung“ des Lörms muss
abgeschaffl werden, weil die slörende Unlerbrechungswírkung der Lörmspilzen so níchl
erfassl wird.
This is the result of the standalone Pdf Editor:
Schon heute ist der an- und abschwellende Güterzuglörm für uns sehr
störend und hat bereits stark zugenommen. Die vorgelegte Planung erfüllt nicht
einmal die gesetzlichen Grenzwerte. Schienenlörm ist nicht harmloser als anderer Lärm,
der an— und abschwellende Larm ist sogar schlimmer. Die „Mittelung“ des Lörms muss
abgeschafft werden, weil die störende Unterbrechungswirkung der LdrmSpitzen so nicht
erfasst wird.
I noticed that the text recogniztion in the Pdf Editor takes a little longer on the contrary the OCR function is much faster.

// Martin

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Thu Jan 18, 2018 1:51 pm
by Tracker Supp-Stefan
Hi Martin,

You can send the sample file to support@pdf-xchange.com and we will pass it along to Alex!

Cheers,
Stefan

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Thu Jan 18, 2018 2:05 pm
by MartinCS
Hi Stefan,

I send you the email containing the pdf file. Please forward my second email containing the screenshot with my settings for Pdf Editor.

Thank you!

// Martin

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Thu Jan 18, 2018 2:09 pm
by Tracker Supp-Stefan
Hi Martin,

Thanks, we got the files and are already passing them along!

Cheers,
Stefan

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Thu Jan 18, 2018 3:24 pm
by MartinCS
:D Thank you!

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Fri Jan 19, 2018 11:03 am
by Sasha - Tracker Dev Team
Hello Martin,

I've reproduced some strange behavior - will investigate.

Cheers,
Alex

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Fri Jan 19, 2018 3:57 pm
by MartinCS
Hi Alex,

thank you for the information! I'm hoping you will find the issue. Fingers crossed!

Will wait for your reply.

// Martin

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Sat Jan 20, 2018 7:46 am
by Sasha - Tracker Dev Team
Hello Martin,

I've found what caused this behavior - it is the license key. When OCR renders the page for the recognition it also renders the Watermark. If you specify a valid dev key then the OCR would work the same as in the End-User Editor.

Cheers,
Alex

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Mon Jan 22, 2018 5:13 am
by MartinCS
Hi Alex,

I'm happy you've been able to find the issue. Although, I don't understand that our license key should be the issue. Last year we sent your sales department a signed license agreement with the attached license key which we have received for the Editor SDK. I will sent you the license agreement in a separate email. Could you check the license please an let me know if there something wrong with it?

Thank you!

// Martin

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Mon Jan 22, 2018 8:07 am
by Sasha - Tracker Dev Team
Hello Martin,

That's the issue that I've experienced (if you enter an invalid key and get a watermark on page). In your case, it looks like some settings are off. Please try doing the OCR with these parameters (these should be identical to the screenshot that you have provided earlier):

Code: Select all

private void OCRPages(PDFXEdit.IPXV_Inst Inst, PDFXEdit.IPXV_Document Doc)
{
	int nID = Inst.Str2ID("op.document.OCRPages", false);
	PDFXEdit.IOperation Op = Inst.CreateOp(nID);
	PDFXEdit.ICabNode input = Op.Params.Root["Input"];
	input.v = Doc;
	PDFXEdit.ICabNode options = Op.Params.Root["Options"];
	options["PagesRange.Type"].v = "All"; //OCR all pages
	options["OutputType"].v = 0;
	options["OutputDPI"].v = 300;
	options["ExtParams.Language"].v = "deu+eng"; //separate the needed languages with +
	options["ExtParams.Accuracy"].v = 300;
	options["ExtParams.AutoDeskew"].v = false;
	Inst.AsyncDoAndWaitForFinish(Op);
}
Cheers,
Alex

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Tue Jan 23, 2018 12:31 pm
by MartinCS
Alex,

I sent you an email today containing a link for a solution file. If you have any further questions, please let me know.

// Martin

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Tue Jan 23, 2018 1:02 pm
by Sasha - Tracker Dev Team
Hello Martin,

What email are you talking about exactly? The code that I provided should work the same as the OCR in the End-User Editor. Have you tried it?

Cheers,
Alex

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Wed Jan 24, 2018 4:24 am
by MartinCS
Hi Alex,

I have to appologize! I just realized that I used the wrong email when I was sending my email yesterday. I will re-send it using the correct email this time. Yes, I did try the code but it doesn't make any changes to the positive. I also noticed that this code of line doesn't take that long like the ocr recognizition takes in the End-User Editor:

Code: Select all

pdfCtl.Inst.AsyncDoAndWaitForFinish(pOp);
It also seems like that the ocr recogniztion stops right after the first page and if i open the "OCR-Processed.pdf" (see code in solution) it did not add the additional text layer.

// Martin

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Wed Jan 24, 2018 8:18 am
by Sasha - Tracker Dev Team
Hello Martin,

Please mail me to the polaringu@tracker-software.com directly.

Cheers,
Alex

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Wed Jan 24, 2018 3:04 pm
by Sasha - Tracker Dev Team
Hello Martin,

It seems there is a problem with languages in your project - please read this post:
viewtopic.php?p=97913#p97951
Also, there is a problem with plugin loading and Instance usage:
1) The InitializeComponent method should be after the Inst initialization and plugin loading
2) You should also include the Shutdown method for Inst in the FormClosed event.
For all of this see FullDemo.
Also, the strange OCR results are reoccurring here - will investigate. Are you sure that you have taken all of the latest files from the End-User Editor?

Cheers,
Alex

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Tue Jan 30, 2018 5:47 am
by MartinCS
Hi Alex,

I addressed the two points you mentioned and I'm absolutely positive that I'm using the latest files from End-User Editor. I also tested the updated files from the last version you puglished last week for the End-User Editor. But still there are no changes to the positive.

// Martin

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Tue Jan 30, 2018 7:55 am
by Sasha - Tracker Dev Team
Hello Martin,

We are holding a release in a couple of days - I will be able to tend to your problem afterwards.

Cheers,
Alex

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Thu Feb 01, 2018 4:48 am
by MartinCS
Hi Alex,

Thank you very much for the information!

// Martin

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Thu Feb 01, 2018 11:45 am
by Tracker Supp-Stefan
:D

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Sat Feb 03, 2018 1:28 pm
by Sasha - Tracker Dev Team
Hello Martin,

Have you tried the new build? Please do so and see whether the problem still recreates.

Cheers,
Alex

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Thu Feb 08, 2018 3:44 pm
by MartinCS
Hi Alex,

I sent you an email with detailed information.

// Martin

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Tue Feb 13, 2018 7:56 am
by MartinCS
Hi Alex,

have you received my email and did you have a Chance to have a look at it?

// Martin

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Tue Feb 13, 2018 3:57 pm
by Sasha - Tracker Dev Team
Hello Martin,

Got your E-Mail and will further investigate the problem tomorrow.

Cheers,
Alex

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Thu Feb 15, 2018 8:41 am
by Sasha - Tracker Dev Team
Hello Martin,

Yesterday I've tested your sample and the End-User Editor - it gives the same results as an output.

Cheers,
Alex

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Tue Feb 27, 2018 10:24 am
by MartinCS
Hi Alex,

it's still not working on my computer using the sample project I created for you. I was also looking through the forum trying to find similar issues with OCR and it's respective/correct way to load the plugin dependencies and files. To be honest I'm confused how to correctly set up a .NET project using C# and the activeX components of Tracker Editor SDK.

This is my bin (output) folder, when I compile the sample project with Visual Studio 2017:
27-02-_2018_11-04-11.jpg
1. Contains the sample pdf files for this test project
2. Contains the same files which are resided inside the same folder of the End User Tracker Editor installation, e.g. "C:\Program Files\Tracker Software\PDF Editor\Plugins.x86" with files like "OCRPlugin.pvp" etc.
3. Contains the same "OCRLanguages" folder which is resided inside the same folder of the End User Tracker Editor installation, e.g. "C:\Program Files\Tracker Software\PDF Editor\PluginsData\OCRLanguages" with files like "deu_pxvocr.dat", "deu_pxvocr.lng" etc.
4. Are the original dll files by Tracker which have been referenced und puglished by Visuals Studio when putting the activeX Pdf Control "AxPDFXEdit.AxPXV_Control" on a form.

But I still have the issue that the following line of code seems like not to be executed:

pdfCtl.Inst.AsyncDoAndWaitForFinish(pOp);

Code: Select all

var nId = pdfCtl.Inst.Str2ID("op.document.OCRPages", false);
var pOp = pdfCtl.Inst.CreateOp(nId);
var input = pOp.Params.Root["Input"];
input.v = doc;
ICabNode options = pOp.Params.Root["Options"];
options["PagesRange.Type"].v = "All";
options["OutputType"].v = 0;
options["OutputDPI"].v = 300;
options["ExtParams.Language"].v = "deu+eng";
options["ExtParams.Accuracy"].v = 300;
options["ExtParams.AutoDeskew"].v = false;

pdfCtl.Inst.AsyncDoAndWaitForFinish(pOp);
Is there something missing in my project setup to get this OCR function to work?

// Martin

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Wed Feb 28, 2018 7:30 am
by Sasha - Tracker Dev Team
Hello Martin,

Have you initialized the OCR plug-in in your project before the control initialization?
Have you tried doing pOp.Do()?

Cheers,
Alex

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Wed Feb 28, 2018 7:59 am
by MartinCS
Hi Alex,

yes, I did.
Have you initialized the OCR plug-in in your project before the control initialization?

Code: Select all

public frmMain()
{
	_pxvInst = new PXV_Inst();

	InitializeComponent();

	_fsInst = (IAFS_Inst)pdfCtl.Inst.GetExtension("AFS");
	_pxcInst = (IPXC_Inst)pdfCtl.Inst.GetExtension("PXC");

	// set license key
	pdfCtl.SetLicKey(@"<ourvalidlicensecode>");

	startUpPath = Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location);

	// also see embedded folders 'Plugins.x*', 'PluginData'
	_pxvInst.StartLoadingPlugins();
	_pxvInst.AddPluginFromFile(
		Path.Combine(startUpPath, "Plugins.x86", "OCRPlugin.pvp"));
	_pxvInst.FinishLoadingPlugins();
}
Have you tried doing pOp.Do()?
If this is the right place for this line of code:

Code: Select all

var password = ""; // not needed in this sample project

var authCallback = new PdfEditorAuthCallback();

var doc = _pxcInst.OpenDocumentFromFile(filePathSource, authCallback);

if (authCallback.IsPasswodProtected)
{
    try
    {
        doc.AuthorizeWithPassword(password);
    }
    catch { }
}

var nId = pdfCtl.Inst.Str2ID("op.document.OCRPages", false);
var pOp = pdfCtl.Inst.CreateOp(nId);
var input = pOp.Params.Root["Input"];
input.v = doc;
ICabNode options = pOp.Params.Root["Options"];
options["PagesRange.Type"].v = "All";
options["OutputType"].v = 0;
options["OutputDPI"].v = 300;
options["ExtParams.Language"].v = "deu+eng";
options["ExtParams.Accuracy"].v = 300;
options["ExtParams.AutoDeskew"].v = false;

pOp.Do();

pdfCtl.Inst.AsyncDoAndWaitForFinish(pOp);
// Martin

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Wed Feb 28, 2018 3:44 pm
by Sasha - Tracker Dev Team
Hello Martin,

If you do the pOp.Do() - the operation will be executed in the main thread. If you do the AsyncDoAndWaitForFinish then the operation will be executed in the different thread and the progress will be displayed. I've asked you to try doing this in the main thread so there is no need to execute AsyncDoAndWaitForFinish afterwards.

Cheers,
Alex

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Thu Mar 01, 2018 5:24 am
by MartinCS
Alex,

I did try

Code: Select all

pOp.Do();
But I don't get the ocr layer with the text recogniztion results on the pdf document. If I step over the above line it just takes 1 second for this operation. Doing the OCR function in PDF Editor it takes at least 15 seconds.
I don't get any exceptions when I run my code. I also do get the same behaviour using different test files. In one of my previous post I was asking you if my project (with all the necessary plugin data files and dlls) is setup correctly. Did I configure everything the right way?

// Martin

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Thu Mar 01, 2018 7:26 am
by Sasha - Tracker Dev Team
Hello Martin,

The thing is that your project did work correctly for me - that's the strange thing.
Also, please add this line of code and try again:

Code: Select all

options["OCRNoTextPagesOnly"].v = false;
Cheers,
Alex

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Thu Mar 01, 2018 9:58 am
by MartinCS
Alex,

I'm at the same point like you are. I don't understand why my project is running on your computer but not on ours. Meanwhile I tested it on different machines. All of them had the PDF Editor SDK installed.

I added your the new code line, but it didn't make a difference. Might there - somehow - be a problem with our license key which prevents using the OCR function via SDK?

// Martin

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Thu Mar 01, 2018 10:02 am
by Sasha - Tracker Dev Team
Hello Martin,

There shouldn't be a problem with the license key at all - the only thing that it does is makes the result much more precise - even without it everything should work fine.

Cheers,
Alex

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Thu Mar 01, 2018 10:08 am
by MartinCS
Alex,

as you confirm that I'm not doing anything wrong and it should work like it does on your computer I would like to ask you if it is possible to do a short remote session via Teamviewer in order to investigate the issue together?

// Martin

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Fri Mar 02, 2018 4:01 pm
by Sasha - Tracker Dev Team
Hello Martin,

I will make you a project using a manifest and we will see whether it will work for you.

Cheers,
Alex

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Fri Mar 02, 2018 7:26 pm
by MartinCS
Hi Alex,

that's very kind of you! Thank you!

// Martin

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)  SOLVED

Posted: Sat Mar 03, 2018 7:59 am
by Sasha - Tracker Dev Team
Hello Martin,

Here's a sample with extended German dictionary and it uses manifest:
OCRSample.zip
(42.41 MiB) Downloaded 192 times
Cheers,
Alex

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Wed Mar 07, 2018 5:54 am
by MartinCS
Hi Alex,

the sample worked right away and I'm able to set it up in my productive implementation of the Pdf Editor.

A big thank you for your help on this topic!!!

// Martin

Re: Retrieve text of 'OCRPages' function and correct use of languages (folder)

Posted: Wed Mar 07, 2018 7:46 am
by Sasha - Tracker Dev Team
:)