Size of the OCR Files (SDK) 10 x larger as with the Editor

PDF-X OCR SDK is a New product from us and intended to compliment our existing PDF and Imaging Tools to provide the Developer with an expanding set of professional tools for Optical Character Recognition tasks

Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Tracker Supp-Stefan

Post Reply
michipapa
User
Posts: 41
Joined: Tue Dec 08, 2009 10:44 pm

Size of the OCR Files (SDK) 10 x larger as with the Editor

Post by michipapa »

Hi Support,

I have made same tests with my brand-new OCR-SDK.
Unfortunately its not usable for me:
The size of the OCR´d files with the new text layer are 10 x larger as the original and the OCR´d file from the Editor.
Whats wrong ?

1. Original 4.1 MB
2. OCR with Editor 4.2 MB (thats what I expect)
3. OCR with SDK 58,4 MB !!!

I use the high level function with 300 dpi.

I saw a thread from 2016 with the same behavior. Did I download the wrong build ? I took it from your download site.

regards Michael
Sasha - Tracker Dev Team
User
Posts: 5522
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Post by Sasha - Tracker Dev Team »

Hello Michael,

If you can, please provide 3 files (original, Editor OCR, SDK OCR) along with the settings that you were using in both cases.
If they are not meant for public, you can mail them directly to me polaringu@tracker-software.com
Also, please tell us what version of the SDK you are using.

Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
michipapa
User
Posts: 41
Joined: Tue Dec 08, 2009 10:44 pm

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Post by michipapa »

Hi Alex,

thx for quick response. You have 2 E-Mails.

regards Michael
Sasha - Tracker Dev Team
User
Posts: 5522
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Post by Sasha - Tracker Dev Team »

Hello Michael,

I've received those and forwarded them to the appropriate developer + removed the file from server.

Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
Sasha - Tracker Dev Team
User
Posts: 5522
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Post by Sasha - Tracker Dev Team »

Hello Michael,

We've investigated the problem and have a solution for you.
For starters - here are the image parameters from the first page.
Editor:
Capture_Editor.PNG
Capture_Editor.PNG (35.54 KiB) Viewed 23173 times
OCR SDK:
Capture_OCRSDK.PNG
Capture_OCRSDK.PNG (37.62 KiB) Viewed 23173 times
As you can see - the image was reformatted and replaced.
There is a way to suppress this with a flag. You are using the PXO_Options to set the input structure. There is an ImageFlags field that sets the OCR_ImageProcessingFlags. You will have to set the OCR_Image_SuppressOutput flag. What it does is suppresses image output in searchable PDF created with OCR_MakeSearchable(). It can be used to create image-free page of invisible text that can be merged with the original content page.
That way, the original image won't be modified and the size won't be increased in a way that you experience it.

Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
michipapa
User
Posts: 41
Joined: Tue Dec 08, 2009 10:44 pm

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Post by michipapa »

Hi Alex,

thx for your answer.

Your workaround means that I need additional code to put both pdfs together.
Could you please give me a hint which function of your api that does ? Some links of your included help-pdfs ends in Nirwana.

BTW, should I be the first user of your OCR SDK ? If not, every user should have the same problem as I !
Why don´t you put a image flag or something else in the PXO_Options that puts only the text layer to the original file ?

I have expected a simple OCR solution that gives me the same result as the GUI from the Editor ....
Your included sample code snippets suggests a simple way to do that ...

regards Michael
Sasha - Tracker Dev Team
User
Posts: 5522
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Post by Sasha - Tracker Dev Team »

Hello Michael,

Just spoke with that developer again - he said that the SDK was renewed entirely and now is based on the Editor OCR engine. From what I know, there are many users of the SDK, but probably you are a first one to ask this question for the new SDK version. Please try passing that flag to the method that you are using to OCR the document and it should just place the OCRed text on top of the existing image - no more code is needed.

Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
michipapa
User
Posts: 41
Joined: Tue Dec 08, 2009 10:44 pm

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Post by michipapa »

HI Alex,
>Please try passing that flag to the method that you are using to OCR the document and it should just place the OCRed text on top of the existing image - no more code is needed.
no, that doesn´t work.

I set

Options.ImageFlags=BinaryOR(0x0001,0x0008) //Autorotate und Suppress Output

and got a pdf files that contains only the text layer. Thats not what I want.
I have to merge the input file and the text layer file manually together (with editor) to get a searchable file.

regards Michael
Sasha - Tracker Dev Team
User
Posts: 5522
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Post by Sasha - Tracker Dev Team »

Hello Michael,

OK, you are right here - logically the OCR should have worked like you advise (base image + overlayed text) - we'll review that logic on our next conference.
Can you please tell the name of the product that you've bought\are planning to buy? Based on that I hopefully can give one of possible solutions that should work in this version.

Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
michipapa
User
Posts: 41
Joined: Tue Dec 08, 2009 10:44 pm

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Post by michipapa »

Hi,

I bought PDF-XChange PRO SDK. Thats included the OCR SDK. And that causes the problem that we talking about.

regards
Sasha - Tracker Dev Team
User
Posts: 5522
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Post by Sasha - Tracker Dev Team »

Hello Michael,

If so, then there is a solution for you right now. In the PRO SDK bundle - there is a Core API available. What you will have to do is take the source PDF with images and then take the resulting PDF with text and overlay two pages (basically copying the content and placing it after the image). We have a sample base (that is frequently being updated) on the Core API - you can download it from here:
https://github.com/tracker-software/PDFCoreSDKExamples
Basically you will have to open two IPXC_Documents and then for each page of the resulting one (with text) you will have to
https://sdkhelp.pdf-xchange.com/vi ... GetContent
and then
https://sdkhelp.pdf-xchange.com/vi ... aceContent
in the correspondent page of the image-only document.

Also, with the Core API, you can see the pages that have only the image content items and text content items (from what I remember, you have also needed this one).

Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
michipapa
User
Posts: 41
Joined: Tue Dec 08, 2009 10:44 pm

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Post by michipapa »

Hi Sasha,

Ok, I try to do that.

Any chance, that you put another option in your OCR SDK to do that automatically ?

Believe me, with this problem in your software you will sell much less of OCR SDK because its out of the box useless

regards Michael
Sasha - Tracker Dev Team
User
Posts: 5522
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Post by Sasha - Tracker Dev Team »

Hello Michael,

We'll think about this option as it is a needed one. Also, the OCR SDK did behave like that in the older versions (the size issue), though you could always lower the output dpi and the size won't be that great - the conversion algorithm does also take place (you can do this now for the same effect).

Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
michipapa
User
Posts: 41
Joined: Tue Dec 08, 2009 10:44 pm

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Post by michipapa »

Hi,

"think about" means not in the next 3-4 weeks, right ?

Ok, I try your workaround but If that not work with my language (Windev) I´m afraid that I have to look for another solution and give the licence back to you.

regards Michael
michipapa
User
Posts: 41
Joined: Tue Dec 08, 2009 10:44 pm

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Post by michipapa »

Hi Support,

I just saw a thread in this forum from 2014 / 2016 with the topic
"Size of OCRed Files" viewtopic.php?f=42&t=21670

It looks like it is the same problem as mine.
In 2016 you promise a solution ... but it seems that you nothing fixed in 2 years.
Could you please put the neccassary code in the dll to join the original pdf with the text layer like you mentioned ?

Then we are all satified. And I would like happy to be your Beta-Tester for that.

regards Michael
Sasha - Tracker Dev Team
User
Posts: 5522
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Post by Sasha - Tracker Dev Team »

Hello Michael,

Just discussed this with our lead dev. - we'll add a flag that allows taking the old image and placing text on top of it. Hopefully this won't be too complex to implement so that we can give you a test dll before the release.

Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
michipapa
User
Posts: 41
Joined: Tue Dec 08, 2009 10:44 pm

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Post by michipapa »

Hi Alex,

I´m glad to read this.

I need it in the next 3 weeks. I´m sure that its not so complex for you. You gave me the hint how to do this and for you its probably much easier as for me.

regards Michael
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17810
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Post by Tracker Supp-Stefan »

Hello Michael,

I've written to both Sasha and the lead dev and requested that we get a solution to you as quickly as possible! As soon as I have any further news - I will let you know!

Kind Regards,
Stefan
Sasha - Tracker Dev Team
User
Posts: 5522
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Post by Sasha - Tracker Dev Team »

Hello Michael,

We are working on it right now - though this will require to rewrite some inner logic - this is what we are discussing and implementing.

Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
Sasha - Tracker Dev Team
User
Posts: 5522
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Post by Sasha - Tracker Dev Team »

Hello Michael,

Please try this dll - we've added a custom flag OCR_Content_Original that should overlay the recognized text over the original content.
OcrTools.x86.zip
(8.82 MiB) Downloaded 317 times
Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
michipapa
User
Posts: 41
Joined: Tue Dec 08, 2009 10:44 pm

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Post by michipapa »

Hi Alex,

which value has OCR_Content_Original ?

regards Michael
Sasha - Tracker Dev Team
User
Posts: 5522
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Post by Sasha - Tracker Dev Team »

Hello Michael,

Oops, my bad - forgot that this is a numeric value:

Code: Select all

OCR_Content_Original = 0x0040 // output original content instead of image
Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
michipapa
User
Posts: 41
Joined: Tue Dec 08, 2009 10:44 pm

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Post by michipapa »

Hi,
no, it doesn´t work.

I use the example which I send you at saturday. The original file has 4,1 MB the output file now 62,6 MB and the orientation is damaged at all pages.
You can try it with my example.

I try it with Options.ImageFlags=Binaryor(0x0001,0x0040)


regards Michael
Serg - Tracker Dev
User
Posts: 14
Joined: Wed Sep 17, 2014 7:40 am
Location: Ukraine

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Post by Serg - Tracker Dev »

Hi, Michael

try to use all flags:
Options.ImageFlags=Binaryor(0x0001,0x0048)
michipapa
User
Posts: 41
Joined: Tue Dec 08, 2009 10:44 pm

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Post by michipapa »

Hi Serg,

ok, thx. Now its working.

I´ll try this tomorrow with a couple of files and give you a feedback.

regards Michael
Sasha - Tracker Dev Team
User
Posts: 5522
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Post by Sasha - Tracker Dev Team »

Hello Michael,

Do try that and tell us if there are any problems.

Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
michipapa
User
Posts: 41
Joined: Tue Dec 08, 2009 10:44 pm

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Post by michipapa »

Hi Tracker,

the ocr api works so far.

But somes files (which work with the ocr function in the editor) causes a crash in the ocr api. I have send you the files some days ago.
Did you investigate them ?

regards Michael
Sasha - Tracker Dev Team
User
Posts: 5522
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Post by Sasha - Tracker Dev Team »

Hello Michael,

We did experience some problems with one of your files, though I'm afraid that the problem is in deep parts of the Tesseract engine. We'll try to update it and see whether the problem reoccurs.

Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
Sasha - Tracker Dev Team
User
Posts: 5522
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Post by Sasha - Tracker Dev Team »

Hello Michael,

We've fixed this problem, here's a dll for you to try:
OcrTools.x86.zip
(8.82 MiB) Downloaded 294 times
Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
michipapa
User
Posts: 41
Joined: Tue Dec 08, 2009 10:44 pm

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Post by michipapa »

Hi Tracker,

I tested it and it works now also for the rest of my files.
Thank you again for the excellent support.

Regard Michael

PS:

If you have some free ressources in the future it would be nice if you can implement another function like this:

For n=1 to pdf..NumberOfPages
If ContainsImages (PDF..page [n])=true
Ocr (Pdf..page [n])
Else
DoNothing ()
End
End
Sasha - Tracker Dev Team
User
Posts: 5522
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Size of the OCR Files (SDK) 10 x larger as with the Editor

Post by Sasha - Tracker Dev Team »

Hello Michael,

Glad that works for you. As for your request - we will take that into consideration while working on the OCR improvement.

Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
Post Reply