PDFXChange PRO OCR SDK Issue OCR Text output is junk characters

PDF-X OCR SDK is a New product from us and intended to compliment our existing PDF and Imaging Tools to provide the Developer with an expanding set of professional tools for Optical Character Recognition tasks

Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Sean - Tracker, Chris - Tracker Supp, Tracker Supp-Stefan

Post Reply
admin-emmeluth
User
Posts: 3
Joined: Tue Mar 17, 2015 6:16 am

PDFXChange PRO OCR SDK Issue OCR Text output is junk characters

Post by admin-emmeluth » Fri Oct 19, 2018 3:37 am

Hi,

We tried using the latest PDFXchange PRO OCR SDK [main dll: OcrTools.x64.dll ] using the C# example demo provided along with the SDK . But the text output after OCR in the PDF document is just junk characters. Attached is the C# demo code used.Also attached is the document that need to be OCRed. Please see the attache image file which contains the sample junk text copied from the PDF doc after OCR. Please advise what needs to be done in code to fix this issue.
textAfterOCR.PNG
Thanks,
CL team.
Attachments
WO2014044840A1.pdf
(2.74 MiB) Downloaded 53 times

User avatar
Tracker Supp-Stefan
Site Admin
Posts: 13428
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: PDFXChange PRO OCR SDK Issue OCR Text output is junk characters

Post by Tracker Supp-Stefan » Thu Nov 01, 2018 11:26 am

Hello admin-emmeluth,

Apologies for the delay in following up on this one!
I've passed it along to a colleague in the dev team who works with our OCR SDK, and as soon as we have any further advise we will post here!

Regards,
Stefan

User avatar
Tracker Supp-Stefan
Site Admin
Posts: 13428
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: PDFXChange PRO OCR SDK Issue OCR Text output is junk characters

Post by Tracker Supp-Stefan » Thu Nov 01, 2018 1:11 pm

Hello admin-emmeluth,

My colleague who reviewed your code says that you've specified 100 DPI for rasterization, please try with 200 or 300 DPI - and you will get much better results!

Regards,
Stefan

admin-emmeluth
User
Posts: 3
Joined: Tue Mar 17, 2015 6:16 am

Re: PDFXChange PRO OCR SDK Issue OCR Text output is junk characters

Post by admin-emmeluth » Fri Nov 02, 2018 2:58 am

Hi,

Thanks for your timely help.

As suggested we tried with 300 DPI. But, the file size is coming out to be 13MB for 3MB file.
When tried with 200 DPI, the output is not correct as it is putting junk characters and the file size is 8MB.
Please help on this.

User avatar
Tracker Supp-Stefan
Site Admin
Posts: 13428
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: PDFXChange PRO OCR SDK Issue OCR Text output is junk characters

Post by Tracker Supp-Stefan » Fri Nov 02, 2018 1:12 pm

Hello admin-emmeluth,

Currently you have this flag:
Options.ImageFlags = (uint)PDFXOCR.PDFXOCR_Funcs.OCR_ImageProcessingFlags.OCR_Image_Autorotate;

Please try to also add this one:
OCR_Content_Original

And let us know the result!

Regards,
Stefan

admin-emmeluth
User
Posts: 3
Joined: Tue Mar 17, 2015 6:16 am

Re: PDFXChange PRO OCR SDK Issue OCR Text output is junk characters

Post by admin-emmeluth » Mon Nov 05, 2018 4:04 am

Hi,

Thanks for the prompt reply.

I could not see any option for OCR_Content_Original in the current sample code provided.
As I browse through your forum I could see that there is a image flag set for that and I applied the same like the one below bold one in the PDFXOCR_Funcs class & then applied in OCR Options as

Code: Select all

[b]Options.ImageFlags = (uint)PDFXOCR.PDFXOCR_Funcs.OCR_ImageProcessingFlags.OCR_Content_Original;[/b]

Code: Select all

        public enum OCR_ImageProcessingFlags
        {
            	OCR_Image_NoRotate = 0x0000,
	            OCR_Image_Autorotate = 0x0001,
	            OCR_Image_EdgeRefine = 0x0002,
	            OCR_Image_GaussianBlur = 0x0004,
	            OCR_Image_SuppressOutput = 0x0008, // only place text layer
	            OCR_Image_FastAutorotate = 0x0011, // OCR_Image_Autorotate bit included
	            OCR_Text_PlaceByLines = 0x0020, // smaller but less accurate output.
                [b]OCR_Content_Original = 0x0040[/b]
        }
Is this the correct way to do it? As, I am getting same results as mentioned in my previous comments with larger file size and junk characters when reduce DPI. Please help. We don't have much options left now.

User avatar
Tracker Supp-Stefan
Site Admin
Posts: 13428
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: PDFXChange PRO OCR SDK Issue OCR Text output is junk characters

Post by Tracker Supp-Stefan » Mon Nov 05, 2018 12:00 pm

Hello admin-emmeluth,

Please take a look at this topic from September:
viewtopic.php?f=42&t=31422

That discussion there made it necessary for our developers to include this OCR_Content_Original parameter.

The fixes provided in the custom DLLs included there should already be included in the latest live SDK builds on our website, so please make sure to update to build 327.1 if you have not already!

Regards,
Stefan

p.s. I have also blocked your serial numbers which you forgot to remove from your sample project.
I also removed the project from your original post.
Please contact us on sales@tracker-software.com so that we can issue you replacement ones.

Post Reply