Text that is neither text nor image, but totally another text and symbols  SOLVED

Discussion for the End User use of OCR in PDF-XChange Editor and Viewer

Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Tracker Supp-Stefan, Ivan - Tracker Software

Post Reply
asocialis
User
Posts: 40
Joined: Mon Feb 11, 2019 12:29 am

Text that is neither text nor image, but totally another text and symbols

Post by asocialis »

What is this? When I try to copy this text, it looks like it is text, but not what I see. Instead totally random characters.
It is not image, so can't fix it by OCR. What to do? How is it possible that I see text correctly but when I copy or edit it, it is unreadable.
I see in Content pane that it is really text. How to fix it? How is it possible for that to exist?
Sample file: https://mo.ks.gov.ba/sites/mo.ks.gov.ba/files/zakon_o_osnovama_sigurnosti_saobracaja_na_putevima_u_bosni_i_hercegovini_glasnik_broj_6_06.pdf
For example the word ZAKON which is bosnian word for LAW and which appears on first page in large bold letters near bottom left, is shown as yz{|}. All words are shown as random characters in Content pane or when copied, so search is useless. But somehow all are perfectly displayed.
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 7209
Joined: Wed Jan 03, 2018 6:52 pm

Re: Text that is neither text nor image, but totally another text and symbols  SOLVED

Post by TrackerSupp-Daniel »

Hello, asocialis

What this indicates is that the font embedded within the document was not embedded properly, and so its glyph association data is corrupt. What this leads to is the font assigning (usually in order of appearance) character codes to the glyphs on screen, so if you had a sentance that wrote the following:
"This is a test"
it may come out as:
"abcdecdefeagda"

This is not an issue within our software, but an issue within the document itself.
Now, I should note that to resolve this, OCR is certainly an option, simply open the OCR pages dialog, and ensure you disable the option to ignore existing text. Then set the OCR to create editable text and images, and it will scan and then replace the original text with properly formatted text that uses a real font.

Kind regards,
Daniel McIntyre - Support Technician
Tracker Software Products (Canada) LTD

Support: <Support@tracker-software.com>
Post Reply