Text that is neither text nor image, but totally another text and symbols

asocialis · Post by **asocialis** » Tue May 23, 2023 8:25 pm

What is this? When I try to copy this text, it looks like it is text, but not what I see. Instead totally random characters.
It is not image, so can't fix it by OCR. What to do? How is it possible that I see text correctly but when I copy or edit it, it is unreadable.
I see in Content pane that it is really text. How to fix it? How is it possible for that to exist?
Sample file: https://mo.ks.gov.ba/sites/mo.ks.gov.ba/files/zakon_o_osnovama_sigurnosti_saobracaja_na_putevima_u_bosni_i_hercegovini_glasnik_broj_6_06.pdf
For example the word ZAKON which is bosnian word for LAW and which appears on first page in large bold letters near bottom left, is shown as yz{|}. All words are shown as random characters in Content pane or when copied, so search is useless. But somehow all are perfectly displayed.

Post by **TrackerSupp-Daniel** » Tue May 23, 2023 8:52 pm

Hello, asocialis

What this indicates is that the font embedded within the document was not embedded properly, and so its glyph association data is corrupt. What this leads to is the font assigning (usually in order of appearance) character codes to the glyphs on screen, so if you had a sentance that wrote the following:
"This is a test"
it may come out as:
"abcdecdefeagda"

This is not an issue within our software, but an issue within the document itself.
Now, I should note that to resolve this, OCR is certainly an option, simply open the OCR pages dialog, and ensure you disable the option to ignore existing text. Then set the OCR to create editable text and images, and it will scan and then replace the original text with properly formatted text that uses a real font.

Kind regards,

Text that is neither text nor image, but totally another text and symbols SOLVED

Text that is neither text nor image, but totally another text and symbols

Re: Text that is neither text nor image, but totally another text and symbols SOLVED