How to recognize text saved in PDF as vector objects (curves)?

Discussion for the End User use of OCR in PDF-XChange Editor and Viewer

Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan

Post Reply
User avatar
Jensen Head
User
Posts: 412
Joined: Mon Sep 13, 2021 8:12 am

How to recognize text saved in PDF as vector objects (curves)?

Post by Jensen Head »

Sample document - https://storagy-itero-production-eu.s3.amazonaws.com/download/ru-ru/План обслуживания iTero.pdf (first page)

After opening this document in the PDF-XChange Editor, after selecting the text, copying it to the clipboard and pasting it into a word processor, the inserted text is displayed as incomprehensible characters.
_
Copy text from PDF document
Copy text from PDF document
2021-11-17_14-16-16.png
Text copied from PDF document
(3.17 KiB) Not downloaded yet
_
I performed recognition with the settings, the screenshot of which is given below, and the result has not changed.
_
"OCR Pages (Enhanced)" settings window
"OCR Pages (Enhanced)" settings window
_
If I select "Type: Editable Text and Images" the copying does not change either. I tried "Optimize PDF" (checking objects in the "Discard Objects", "Discard User Data" and "Cleanup" tabs), and repeating the recognition, but it also did not help.

I would not want to combine all the layers into one raster layer, because the image quality of the original document is perfect (see the fourth screenshot), and when the layers are merged, the quality drops a lot (see https://forum.pdf-xchange.com/viewtopic.php?p=155414).
_
Letter quality of text saved as curves
Letter quality of text saved as curves
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17814
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: How to recognize text saved in PDF as vector objects (curves)?

Post by Tracker Supp-Stefan »

Hello Jensen Head,

The issue is in the file itself. There is enough information for the contents of this file to be displayed correctly on screen, but not enough to be able to copy and paste this text in other text Editors. If you try other PDF Viewers - including Adobe's - they will also be unable to copy content from the original file.

There is a way to achieve proper text extraction, but it involves a few steps.
Because your file has a background that you probably would like to keep - you will need to export those pages to images (300 DPI should be plenty), and then drag and drop the image over the Editor to let it convert it on the fly to a PDF. Now you can run the OCR on the new image based file - and the text recognized there will be selectable and extractable:
image.png
Kind regards,
Stefan
Attachments
План обслуживания iTero.pdf
(894.77 KiB) Downloaded 102 times
Post Reply