Make text visible after OCR and export to Word?

Discussion for the End User use of OCR in PDF-XChange Editor and Viewer

Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan

Post Reply
hans jensen
User
Posts: 4
Joined: Fri Oct 28, 2016 11:16 am

Make text visible after OCR and export to Word?

Post by hans jensen »

When I OCR a scanned page in PDF-XChange and export the resulting PDF to MS Word, I get a file with the scanned page image and a text layer. The text is selectable, but invisible.

I would like to make the text visible (and delete the image) so that I can edit the content, but I cannot find out how (delete the image is fine). I can reset all formatting, and then the text becomes visible, but I loose font sizes and a lot of formatting, and spaces are gone too(?)

It does not seem to be the font formatting, i.e. hidden og white font color, so some other "trick" is used.

Also, I would like to know if it is possible to save the OCR'ed PDF with only the text layer, thus reducing the size of the PDF considerably.
hans jensen
User
Posts: 4
Joined: Fri Oct 28, 2016 11:16 am

Re: Make text visible after OCR and export to Word?

Post by hans jensen »

Oh, I posted in the wrong forum, should have been the OCR forum. Sorry.
User avatar
Will - Tracker Supp
Site Admin
Posts: 6815
Joined: Mon Oct 15, 2012 9:21 pm
Location: London, UK
Contact:

Re: Make text visible after OCR and export to Word?

Post by Will - Tracker Supp »

Hi Hans,

Thanks for the post and no worries - Please see this KB article:
https://www.pdf-xchange.com/knowle ... the-Editor

HTH!
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.

Best regards

Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
hans jensen
User
Posts: 4
Joined: Fri Oct 28, 2016 11:16 am

Re: Make text visible after OCR and export to Word?

Post by hans jensen »

Great! Just what I was looking for.

However ... I notice in the Contents pane, that each text element (line, box, etc.) is handled separately, that's ok, but each element is split into single words. When this is converted to MS Word, it seems the spaces between words are implemented as a "wide" last character, i.e. in the Advanced tab on the Font settings in Word, the width of the last character is increased by e.g. 3 pt. So if I reset the font formatting, all spaces disappear.

Is there any way to get the OCR just to replace the space between words by a space character? If I Save As... plain TXT, spaces are used.

EDIT: To add: In the PDF it seems to be spaces, so it must be in the export to Word this "conversion" is done.
User avatar
Will - Tracker Supp
Site Admin
Posts: 6815
Joined: Mon Oct 15, 2012 9:21 pm
Location: London, UK
Contact:

Re: Make text visible after OCR and export to Word?

Post by Will - Tracker Supp »

Hi Hans,

Can you send a sample for us to take a look at? The spacing and overall layout of the text, after OCRing, will be specific to the document(s) that you're using.

Cheers,
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.

Best regards

Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
hans jensen
User
Posts: 4
Joined: Fri Oct 28, 2016 11:16 am

Re: Make text visible after OCR and export to Word?

Post by hans jensen »

Thanks, Will

Here's an example of a scanned page from our Canon printer, the PDF-XChange Editor OCR'ed page, the PDF with only the visible text layer, and the exported MS Word file.

In the text layer PDF, you can select and edit the text just fine (the kerning etc. looks a little weird, but anyway ...) including spaces between words. After the export to MS Word, spaces are implemented as "wide" last characters. When selecting a text line, the Text property pane does not show any transformations applied, but just the basic font and size - which looks fine.

It would be nice with an option to "normalize" the text after OCR, i.e. remove whatever spacing, tweaks and character distortions the OCR engine did(?) to make text look like the original - at least on the top level text elements (lines) - preserving the overall layout and line spacing would be nice.
Attachments
Org scanned page.zip
Original scan PDF, OCR'ed PDF, text layer only PDF, exported MS Word
(500.17 KiB) Downloaded 169 times
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17767
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: Make text visible after OCR and export to Word?

Post by Tracker Supp-Stefan »

Hello Hans Jensen,

For the OCR Tool the most important thing (after recognizing the text itself ;) ) is to place the new text objects at the correct position (e.g. exactly on top of the word on the image), so that is why there are so many text objects after the OCR Process.
As for the spacing - I am afraid that there isn't much I can suggest now - as in your sample e.g. "5." and the word "Method" are separate objects, and each is placed independently on the PDF page with it's own coordinates as that is perfectly fine for a PDF file but in the word document they have to "flow" after one another. I will speak with our devs to see if there are any settings that can convert the single wide spaces into multiple smaller ones - but then there is a risk that the formatting will be broken, so it might not be possible in the end.

Regards,
Stefan
Post Reply