Page 1 of 1

Make text visible after OCR and export to Word?

Posted: Thu Nov 03, 2016 3:07 pm
by hans jensen
When I OCR a scanned page in PDF-XChange and export the resulting PDF to MS Word, I get a file with the scanned page image and a text layer. The text is selectable, but invisible.

I would like to make the text visible (and delete the image) so that I can edit the content, but I cannot find out how (delete the image is fine). I can reset all formatting, and then the text becomes visible, but I loose font sizes and a lot of formatting, and spaces are gone too(?)

It does not seem to be the font formatting, i.e. hidden og white font color, so some other "trick" is used.

Also, I would like to know if it is possible to save the OCR'ed PDF with only the text layer, thus reducing the size of the PDF considerably.

Re: Make text visible after OCR and export to Word?

Posted: Thu Nov 03, 2016 3:11 pm
by hans jensen
Oh, I posted in the wrong forum, should have been the OCR forum. Sorry.

Re: Make text visible after OCR and export to Word?

Posted: Thu Nov 03, 2016 3:16 pm
by Will - Tracker Supp
Hi Hans,

Thanks for the post and no worries - Please see this KB article:
https://www.pdf-xchange.com/knowle ... the-Editor

HTH!

Re: Make text visible after OCR and export to Word?

Posted: Fri Nov 04, 2016 1:19 pm
by hans jensen
Great! Just what I was looking for.

However ... I notice in the Contents pane, that each text element (line, box, etc.) is handled separately, that's ok, but each element is split into single words. When this is converted to MS Word, it seems the spaces between words are implemented as a "wide" last character, i.e. in the Advanced tab on the Font settings in Word, the width of the last character is increased by e.g. 3 pt. So if I reset the font formatting, all spaces disappear.

Is there any way to get the OCR just to replace the space between words by a space character? If I Save As... plain TXT, spaces are used.

EDIT: To add: In the PDF it seems to be spaces, so it must be in the export to Word this "conversion" is done.

Re: Make text visible after OCR and export to Word?

Posted: Fri Nov 04, 2016 1:33 pm
by Will - Tracker Supp
Hi Hans,

Can you send a sample for us to take a look at? The spacing and overall layout of the text, after OCRing, will be specific to the document(s) that you're using.

Cheers,

Re: Make text visible after OCR and export to Word?

Posted: Fri Nov 04, 2016 2:33 pm
by hans jensen
Thanks, Will

Here's an example of a scanned page from our Canon printer, the PDF-XChange Editor OCR'ed page, the PDF with only the visible text layer, and the exported MS Word file.

In the text layer PDF, you can select and edit the text just fine (the kerning etc. looks a little weird, but anyway ...) including spaces between words. After the export to MS Word, spaces are implemented as "wide" last characters. When selecting a text line, the Text property pane does not show any transformations applied, but just the basic font and size - which looks fine.

It would be nice with an option to "normalize" the text after OCR, i.e. remove whatever spacing, tweaks and character distortions the OCR engine did(?) to make text look like the original - at least on the top level text elements (lines) - preserving the overall layout and line spacing would be nice.

Re: Make text visible after OCR and export to Word?

Posted: Fri Nov 04, 2016 4:18 pm
by Tracker Supp-Stefan
Hello Hans Jensen,

For the OCR Tool the most important thing (after recognizing the text itself ;) ) is to place the new text objects at the correct position (e.g. exactly on top of the word on the image), so that is why there are so many text objects after the OCR Process.
As for the spacing - I am afraid that there isn't much I can suggest now - as in your sample e.g. "5." and the word "Method" are separate objects, and each is placed independently on the PDF page with it's own coordinates as that is perfectly fine for a PDF file but in the word document they have to "flow" after one another. I will speak with our devs to see if there are any settings that can convert the single wide spaces into multiple smaller ones - but then there is a risk that the formatting will be broken, so it might not be possible in the end.

Regards,
Stefan