When I OCR a scanned page in PDF-XChange and export the resulting PDF to MS Word, I get a file with the scanned page image and a text layer. The text is selectable, but invisible.
I would like to make the text visible (and delete the image) so that I can edit the content, but I cannot find out how (delete the image is fine). I can reset all formatting, and then the text becomes visible, but I loose font sizes and a lot of formatting, and spaces are gone too(?)
It does not seem to be the font formatting, i.e. hidden og white font color, so some other "trick" is used.
Also, I would like to know if it is possible to save the OCR'ed PDF with only the text layer, thus reducing the size of the PDF considerably.
Make text visible after OCR and export to Word?
Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan
-
- User
- Posts: 4
- Joined: Fri Oct 28, 2016 11:16 am
-
- User
- Posts: 4
- Joined: Fri Oct 28, 2016 11:16 am
Re: Make text visible after OCR and export to Word?
Oh, I posted in the wrong forum, should have been the OCR forum. Sorry.
- Will - Tracker Supp
- Site Admin
- Posts: 6815
- Joined: Mon Oct 15, 2012 9:21 pm
- Location: London, UK
- Contact:
Re: Make text visible after OCR and export to Word?
Hi Hans,
Thanks for the post and no worries - Please see this KB article:
https://www.pdf-xchange.com/knowle ... the-Editor
HTH!
Thanks for the post and no worries - Please see this KB article:
https://www.pdf-xchange.com/knowle ... the-Editor
HTH!
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.
Best regards
Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
Thank you.
Best regards
Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
-
- User
- Posts: 4
- Joined: Fri Oct 28, 2016 11:16 am
Re: Make text visible after OCR and export to Word?
Great! Just what I was looking for.
However ... I notice in the Contents pane, that each text element (line, box, etc.) is handled separately, that's ok, but each element is split into single words. When this is converted to MS Word, it seems the spaces between words are implemented as a "wide" last character, i.e. in the Advanced tab on the Font settings in Word, the width of the last character is increased by e.g. 3 pt. So if I reset the font formatting, all spaces disappear.
Is there any way to get the OCR just to replace the space between words by a space character? If I Save As... plain TXT, spaces are used.
EDIT: To add: In the PDF it seems to be spaces, so it must be in the export to Word this "conversion" is done.
However ... I notice in the Contents pane, that each text element (line, box, etc.) is handled separately, that's ok, but each element is split into single words. When this is converted to MS Word, it seems the spaces between words are implemented as a "wide" last character, i.e. in the Advanced tab on the Font settings in Word, the width of the last character is increased by e.g. 3 pt. So if I reset the font formatting, all spaces disappear.
Is there any way to get the OCR just to replace the space between words by a space character? If I Save As... plain TXT, spaces are used.
EDIT: To add: In the PDF it seems to be spaces, so it must be in the export to Word this "conversion" is done.
- Will - Tracker Supp
- Site Admin
- Posts: 6815
- Joined: Mon Oct 15, 2012 9:21 pm
- Location: London, UK
- Contact:
Re: Make text visible after OCR and export to Word?
Hi Hans,
Can you send a sample for us to take a look at? The spacing and overall layout of the text, after OCRing, will be specific to the document(s) that you're using.
Cheers,
Can you send a sample for us to take a look at? The spacing and overall layout of the text, after OCRing, will be specific to the document(s) that you're using.
Cheers,
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.
Best regards
Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
Thank you.
Best regards
Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
-
- User
- Posts: 4
- Joined: Fri Oct 28, 2016 11:16 am
Re: Make text visible after OCR and export to Word?
Thanks, Will
Here's an example of a scanned page from our Canon printer, the PDF-XChange Editor OCR'ed page, the PDF with only the visible text layer, and the exported MS Word file.
In the text layer PDF, you can select and edit the text just fine (the kerning etc. looks a little weird, but anyway ...) including spaces between words. After the export to MS Word, spaces are implemented as "wide" last characters. When selecting a text line, the Text property pane does not show any transformations applied, but just the basic font and size - which looks fine.
It would be nice with an option to "normalize" the text after OCR, i.e. remove whatever spacing, tweaks and character distortions the OCR engine did(?) to make text look like the original - at least on the top level text elements (lines) - preserving the overall layout and line spacing would be nice.
Here's an example of a scanned page from our Canon printer, the PDF-XChange Editor OCR'ed page, the PDF with only the visible text layer, and the exported MS Word file.
In the text layer PDF, you can select and edit the text just fine (the kerning etc. looks a little weird, but anyway ...) including spaces between words. After the export to MS Word, spaces are implemented as "wide" last characters. When selecting a text line, the Text property pane does not show any transformations applied, but just the basic font and size - which looks fine.
It would be nice with an option to "normalize" the text after OCR, i.e. remove whatever spacing, tweaks and character distortions the OCR engine did(?) to make text look like the original - at least on the top level text elements (lines) - preserving the overall layout and line spacing would be nice.
- Attachments
-
- Org scanned page.zip
- Original scan PDF, OCR'ed PDF, text layer only PDF, exported MS Word
- (500.17 KiB) Downloaded 175 times
- Tracker Supp-Stefan
- Site Admin
- Posts: 17906
- Joined: Mon Jan 12, 2009 8:07 am
- Location: London
- Contact:
Re: Make text visible after OCR and export to Word?
Hello Hans Jensen,
For the OCR Tool the most important thing (after recognizing the text itself ) is to place the new text objects at the correct position (e.g. exactly on top of the word on the image), so that is why there are so many text objects after the OCR Process.
As for the spacing - I am afraid that there isn't much I can suggest now - as in your sample e.g. "5." and the word "Method" are separate objects, and each is placed independently on the PDF page with it's own coordinates as that is perfectly fine for a PDF file but in the word document they have to "flow" after one another. I will speak with our devs to see if there are any settings that can convert the single wide spaces into multiple smaller ones - but then there is a risk that the formatting will be broken, so it might not be possible in the end.
Regards,
Stefan
For the OCR Tool the most important thing (after recognizing the text itself ) is to place the new text objects at the correct position (e.g. exactly on top of the word on the image), so that is why there are so many text objects after the OCR Process.
As for the spacing - I am afraid that there isn't much I can suggest now - as in your sample e.g. "5." and the word "Method" are separate objects, and each is placed independently on the PDF page with it's own coordinates as that is perfectly fine for a PDF file but in the word document they have to "flow" after one another. I will speak with our devs to see if there are any settings that can convert the single wide spaces into multiple smaller ones - but then there is a risk that the formatting will be broken, so it might not be possible in the end.
Regards,
Stefan