Page 1 of 1

Size of the OCR Files (SDK) 10 x larger as with the Editor

Posted: Fri Aug 31, 2018 8:37 pm
by michipapa
Hi Support,

I have made same tests with my brand-new OCR-SDK.
Unfortunately its not usable for me:
The size of the OCR´d files with the new text layer are 10 x larger as the original and the OCR´d file from the Editor.
Whats wrong ?

1. Original 4.1 MB
2. OCR with Editor 4.2 MB (thats what I expect)
3. OCR with SDK 58,4 MB !!!

I use the high level function with 300 dpi.

I saw a thread from 2016 with the same behavior. Did I download the wrong build ? I took it from your download site.

regards Michael

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Posted: Sat Sep 01, 2018 5:51 am
by Sasha - Tracker Dev Team
Hello Michael,

If you can, please provide 3 files (original, Editor OCR, SDK OCR) along with the settings that you were using in both cases.
If they are not meant for public, you can mail them directly to me polaringu@tracker-software.com
Also, please tell us what version of the SDK you are using.

Cheers,
Alex

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Posted: Sat Sep 01, 2018 7:15 am
by michipapa
Hi Alex,

thx for quick response. You have 2 E-Mails.

regards Michael

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Posted: Sat Sep 01, 2018 7:17 am
by Sasha - Tracker Dev Team
Hello Michael,

I've received those and forwarded them to the appropriate developer + removed the file from server.

Cheers,
Alex

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Posted: Sat Sep 01, 2018 7:30 am
by Sasha - Tracker Dev Team
Hello Michael,

We've investigated the problem and have a solution for you.
For starters - here are the image parameters from the first page.
Editor:
Capture_Editor.PNG
Capture_Editor.PNG (35.54 KiB) Viewed 23241 times
OCR SDK:
Capture_OCRSDK.PNG
Capture_OCRSDK.PNG (37.62 KiB) Viewed 23241 times
As you can see - the image was reformatted and replaced.
There is a way to suppress this with a flag. You are using the PXO_Options to set the input structure. There is an ImageFlags field that sets the OCR_ImageProcessingFlags. You will have to set the OCR_Image_SuppressOutput flag. What it does is suppresses image output in searchable PDF created with OCR_MakeSearchable(). It can be used to create image-free page of invisible text that can be merged with the original content page.
That way, the original image won't be modified and the size won't be increased in a way that you experience it.

Cheers,
Alex

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Posted: Sat Sep 01, 2018 12:34 pm
by michipapa
Hi Alex,

thx for your answer.

Your workaround means that I need additional code to put both pdfs together.
Could you please give me a hint which function of your api that does ? Some links of your included help-pdfs ends in Nirwana.

BTW, should I be the first user of your OCR SDK ? If not, every user should have the same problem as I !
Why don´t you put a image flag or something else in the PXO_Options that puts only the text layer to the original file ?

I have expected a simple OCR solution that gives me the same result as the GUI from the Editor ....
Your included sample code snippets suggests a simple way to do that ...

regards Michael

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Posted: Sat Sep 01, 2018 12:43 pm
by Sasha - Tracker Dev Team
Hello Michael,

Just spoke with that developer again - he said that the SDK was renewed entirely and now is based on the Editor OCR engine. From what I know, there are many users of the SDK, but probably you are a first one to ask this question for the new SDK version. Please try passing that flag to the method that you are using to OCR the document and it should just place the OCRed text on top of the existing image - no more code is needed.

Cheers,
Alex

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Posted: Sat Sep 01, 2018 12:54 pm
by michipapa
HI Alex,
>Please try passing that flag to the method that you are using to OCR the document and it should just place the OCRed text on top of the existing image - no more code is needed.
no, that doesn´t work.

I set

Options.ImageFlags=BinaryOR(0x0001,0x0008) //Autorotate und Suppress Output

and got a pdf files that contains only the text layer. Thats not what I want.
I have to merge the input file and the text layer file manually together (with editor) to get a searchable file.

regards Michael

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Posted: Sat Sep 01, 2018 1:28 pm
by Sasha - Tracker Dev Team
Hello Michael,

OK, you are right here - logically the OCR should have worked like you advise (base image + overlayed text) - we'll review that logic on our next conference.
Can you please tell the name of the product that you've bought\are planning to buy? Based on that I hopefully can give one of possible solutions that should work in this version.

Cheers,
Alex

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Posted: Sat Sep 01, 2018 1:35 pm
by michipapa
Hi,

I bought PDF-XChange PRO SDK. Thats included the OCR SDK. And that causes the problem that we talking about.

regards

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Posted: Sat Sep 01, 2018 1:43 pm
by Sasha - Tracker Dev Team
Hello Michael,

If so, then there is a solution for you right now. In the PRO SDK bundle - there is a Core API available. What you will have to do is take the source PDF with images and then take the resulting PDF with text and overlay two pages (basically copying the content and placing it after the image). We have a sample base (that is frequently being updated) on the Core API - you can download it from here:
https://github.com/tracker-software/PDFCoreSDKExamples
Basically you will have to open two IPXC_Documents and then for each page of the resulting one (with text) you will have to
https://sdkhelp.pdf-xchange.com/vi ... GetContent
and then
https://sdkhelp.pdf-xchange.com/vi ... aceContent
in the correspondent page of the image-only document.

Also, with the Core API, you can see the pages that have only the image content items and text content items (from what I remember, you have also needed this one).

Cheers,
Alex

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Posted: Sat Sep 01, 2018 1:53 pm
by michipapa
Hi Sasha,

Ok, I try to do that.

Any chance, that you put another option in your OCR SDK to do that automatically ?

Believe me, with this problem in your software you will sell much less of OCR SDK because its out of the box useless

regards Michael

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Posted: Sat Sep 01, 2018 2:22 pm
by Sasha - Tracker Dev Team
Hello Michael,

We'll think about this option as it is a needed one. Also, the OCR SDK did behave like that in the older versions (the size issue), though you could always lower the output dpi and the size won't be that great - the conversion algorithm does also take place (you can do this now for the same effect).

Cheers,
Alex

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Posted: Sat Sep 01, 2018 2:32 pm
by michipapa
Hi,

"think about" means not in the next 3-4 weeks, right ?

Ok, I try your workaround but If that not work with my language (Windev) I´m afraid that I have to look for another solution and give the licence back to you.

regards Michael

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Posted: Sun Sep 02, 2018 8:21 am
by michipapa
Hi Support,

I just saw a thread in this forum from 2014 / 2016 with the topic
"Size of OCRed Files" viewtopic.php?f=42&t=21670

It looks like it is the same problem as mine.
In 2016 you promise a solution ... but it seems that you nothing fixed in 2 years.
Could you please put the neccassary code in the dll to join the original pdf with the text layer like you mentioned ?

Then we are all satified. And I would like happy to be your Beta-Tester for that.

regards Michael

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Posted: Mon Sep 03, 2018 6:12 am
by Sasha - Tracker Dev Team
Hello Michael,

Just discussed this with our lead dev. - we'll add a flag that allows taking the old image and placing text on top of it. Hopefully this won't be too complex to implement so that we can give you a test dll before the release.

Cheers,
Alex

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Posted: Mon Sep 03, 2018 7:26 am
by michipapa
Hi Alex,

I´m glad to read this.

I need it in the next 3 weeks. I´m sure that its not so complex for you. You gave me the hint how to do this and for you its probably much easier as for me.

regards Michael

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Posted: Mon Sep 03, 2018 9:57 am
by Tracker Supp-Stefan
Hello Michael,

I've written to both Sasha and the lead dev and requested that we get a solution to you as quickly as possible! As soon as I have any further news - I will let you know!

Kind Regards,
Stefan

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Posted: Mon Sep 03, 2018 11:03 am
by Sasha - Tracker Dev Team
Hello Michael,

We are working on it right now - though this will require to rewrite some inner logic - this is what we are discussing and implementing.

Cheers,
Alex

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Posted: Mon Sep 03, 2018 2:15 pm
by Sasha - Tracker Dev Team
Hello Michael,

Please try this dll - we've added a custom flag OCR_Content_Original that should overlay the recognized text over the original content.
OcrTools.x86.zip
(8.82 MiB) Downloaded 323 times
Cheers,
Alex

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Posted: Mon Sep 03, 2018 2:30 pm
by michipapa
Hi Alex,

which value has OCR_Content_Original ?

regards Michael

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Posted: Mon Sep 03, 2018 2:33 pm
by Sasha - Tracker Dev Team
Hello Michael,

Oops, my bad - forgot that this is a numeric value:

Code: Select all

OCR_Content_Original = 0x0040 // output original content instead of image
Cheers,
Alex

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Posted: Mon Sep 03, 2018 3:08 pm
by michipapa
Hi,
no, it doesn´t work.

I use the example which I send you at saturday. The original file has 4,1 MB the output file now 62,6 MB and the orientation is damaged at all pages.
You can try it with my example.

I try it with Options.ImageFlags=Binaryor(0x0001,0x0040)


regards Michael

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Posted: Mon Sep 03, 2018 5:55 pm
by Serg - Tracker Dev
Hi, Michael

try to use all flags:
Options.ImageFlags=Binaryor(0x0001,0x0048)

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Posted: Mon Sep 03, 2018 7:28 pm
by michipapa
Hi Serg,

ok, thx. Now its working.

I´ll try this tomorrow with a couple of files and give you a feedback.

regards Michael

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Posted: Tue Sep 04, 2018 5:22 am
by Sasha - Tracker Dev Team
Hello Michael,

Do try that and tell us if there are any problems.

Cheers,
Alex

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Posted: Sun Sep 09, 2018 11:11 am
by michipapa
Hi Tracker,

the ocr api works so far.

But somes files (which work with the ocr function in the editor) causes a crash in the ocr api. I have send you the files some days ago.
Did you investigate them ?

regards Michael

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Posted: Mon Sep 10, 2018 10:24 am
by Sasha - Tracker Dev Team
Hello Michael,

We did experience some problems with one of your files, though I'm afraid that the problem is in deep parts of the Tesseract engine. We'll try to update it and see whether the problem reoccurs.

Cheers,
Alex

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Posted: Mon Sep 10, 2018 1:56 pm
by Sasha - Tracker Dev Team
Hello Michael,

We've fixed this problem, here's a dll for you to try:
OcrTools.x86.zip
(8.82 MiB) Downloaded 305 times
Cheers,
Alex

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Posted: Thu Sep 13, 2018 8:18 pm
by michipapa
Hi Tracker,

I tested it and it works now also for the rest of my files.
Thank you again for the excellent support.

Regard Michael

PS:

If you have some free ressources in the future it would be nice if you can implement another function like this:

For n=1 to pdf..NumberOfPages
If ContainsImages (PDF..page [n])=true
Ocr (Pdf..page [n])
Else
DoNothing ()
End
End

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Posted: Fri Sep 14, 2018 6:51 am
by Sasha - Tracker Dev Team
Hello Michael,

Glad that works for you. As for your request - we will take that into consideration while working on the OCR improvement.

Cheers,
Alex