Page 1 of 1

Problem with - character

Posted: Fri Feb 24, 2012 6:54 pm
by Arnold
In a few documents the OCR feature seems to convert the "-" character to ?.

For example, RFP-S-0129-0-2011/EM - Disaster Debris becomes RFP?S?0129?0?2011/EM ? Disaster Debris.

Has anyone else noticed this?

Re: Problem with - character

Posted: Fri Feb 24, 2012 7:00 pm
by Walter-Tracker Supp
Could you send us a sample file (input)? OCR is always going to make mistakes from time to time, but diagnosing why (and whether or not we can come up with a solution) will require some insight into the file causing it.

If you don't want to attach it to this post, you can email it to support@pdf-xchange.com.

-Walter

Re: Problem with - character

Posted: Tue Feb 28, 2012 7:01 pm
by Walter-Tracker Supp
Hi Arnold, I followed up with you by email already, but am posting this to the forum to ensure completeness.

The documents you originally sent contained bad searchable text already. I’m not sure if this came from an older version of our viewer, or something else (e.g. a scanner’s OCR software), but if I remove this bad text, I find OCR works fine with our current viewer build (version 201). OCR also works fine with the fresh scan you just sent. In both cases I get the correct text, with hyphens as expected.

The settings I used were:
- English
- Accuracy: Medium
- Mode: Preserve original content and add text as layer

Therefore I would recommend upgrading to build 201 and trying again from a clean document. If you use “Preserve original content and add text as layer” with the documents you sent originally (that already have OCR text in them), that bad OCR text layer is retained, giving the impression that OCR failed. If you use “Convert page content to image only – add text as layer” then the original bad text is removed and you can see that OCR will work as expected.

But please do make sure you are using build 201 of the viewer to rule out bugs in previous versions that have already been fixed.

-Walter