Problem with - character

Discussion for the End User use of OCR in PDF-XChange Editor and Viewer

Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan

Post Reply
Arnold
User
Posts: 869
Joined: Tue Jun 09, 2009 3:53 am
Location: Florida

Problem with - character

Post by Arnold »

In a few documents the OCR feature seems to convert the "-" character to ?.

For example, RFP-S-0129-0-2011/EM - Disaster Debris becomes RFP?S?0129?0?2011/EM ? Disaster Debris.

Has anyone else noticed this?
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: Problem with - character

Post by Walter-Tracker Supp »

Could you send us a sample file (input)? OCR is always going to make mistakes from time to time, but diagnosing why (and whether or not we can come up with a solution) will require some insight into the file causing it.

If you don't want to attach it to this post, you can email it to support@pdf-xchange.com.

-Walter
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: Problem with - character

Post by Walter-Tracker Supp »

Hi Arnold, I followed up with you by email already, but am posting this to the forum to ensure completeness.

The documents you originally sent contained bad searchable text already. I’m not sure if this came from an older version of our viewer, or something else (e.g. a scanner’s OCR software), but if I remove this bad text, I find OCR works fine with our current viewer build (version 201). OCR also works fine with the fresh scan you just sent. In both cases I get the correct text, with hyphens as expected.

The settings I used were:
- English
- Accuracy: Medium
- Mode: Preserve original content and add text as layer

Therefore I would recommend upgrading to build 201 and trying again from a clean document. If you use “Preserve original content and add text as layer” with the documents you sent originally (that already have OCR text in them), that bad OCR text layer is retained, giving the impression that OCR failed. If you use “Convert page content to image only – add text as layer” then the original bad text is removed and you can see that OCR will work as expected.

But please do make sure you are using build 201 of the viewer to rule out bugs in previous versions that have already been fixed.

-Walter
Post Reply