extra white spaces from OCR of full justified text on scanned pages

Discussion for the End User use uf OCR in PDF-XChange Editor and Viewer

Moderators: TrackerSupp-Daniel, Tracker Support, Chris - Tracker Supp, Vasyl-Tracker Dev Team, Sean - Tracker, Paul - Tracker Supp, Tracker Supp-Stefan, Ivan - Tracker Software

Post Reply
asdf3sthdhawer53
User
Posts: 14
Joined: Wed Feb 03, 2021 10:21 am

extra white spaces from OCR of full justified text on scanned pages

Post by asdf3sthdhawer53 » Wed Feb 03, 2021 11:59 am

When I scan pages that are fully justified, so that some lines have words with big spaces in between, then run OCR, the OCR (enhanced) result is multiple line breaks and white spaces in between words. So if I copy and paste the OCR text, I have to go and manually delete all the spaces and line breaks in between the words.

User avatar
TrackerSupp-Daniel
Site Admin
Posts: 5172
Joined: Wed Jan 03, 2018 6:52 pm

Re: extra white spaces from OCR of full justified text on scanned pages

Post by TrackerSupp-Daniel » Thu Feb 04, 2021 12:10 am

Hi, asdf3sthdhawer53

Unfortunately this is a limit of OCR currently, in the future it may be able to detect justification and other paragraph settings, but I cannot make any promises or offer a timeline for this.

Kind regards,
Daniel McIntyre
Support Technician
Tracker Software Products (Canada) LTD

Support: <Support@tracker-software.com>
Sales: +1 (250) 324-1621
Fax: +1 (250) 324-1623

asdf3sthdhawer53
User
Posts: 14
Joined: Wed Feb 03, 2021 10:21 am

Re: extra white spaces from OCR of full justified text on scanned pages

Post by asdf3sthdhawer53 » Sat Nov 20, 2021 2:18 pm

Is there an update for fixing the white spaces between characters?
Thanks

User avatar
Tracker Supp-Stefan
Site Admin
Posts: 14680
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: extra white spaces from OCR of full justified text on scanned pages

Post by Tracker Supp-Stefan » Mon Nov 22, 2021 11:25 am

Hello asdf3sthdhawer53,

I just tested with the attached file. and it does seem like while the words are recognized separately - the spaces between them are not filled in with e.g. tabs or multiple "space" characters - so when this text is copied from inside the PDF and to e.g. notepad single spaces are added:
image.png
image1.png
image1.png (4.42 KiB) Viewed 90 times
Kind regards,
Stefan
Attachments
New Document.pdf
(63.09 KiB) Downloaded 5 times

asdf3sthdhawer53
User
Posts: 14
Joined: Wed Feb 03, 2021 10:21 am

Re: extra white spaces from OCR of full justified text on scanned pages

Post by asdf3sthdhawer53 » Mon Nov 22, 2021 3:07 pm

If you try German documents, you will see a lot of spaces in between characters inside of words. I get different results for each attempt, which can also change for the same scanned page depending on how many other pages are in the PDF.

I was going to attach a sample page, but when I delete the other pages in the PDF, the problem with the spaces went away, and instead I got some gibberish. I don't want to include the other pages in order to simulate the problem with the spaces, since there is personal data on them, so I won't share it.

User avatar
Paul - Tracker Supp
Site Admin
Posts: 5203
Joined: Wed Mar 25, 2009 10:37 pm
Location: Chemainus, Canada
Contact:

Re: extra white spaces from OCR of full justified text on scanned pages

Post by Paul - Tracker Supp » Mon Nov 22, 2021 5:03 pm

HI asdf3sthdhawer53

do you think you could sanitize the documents? Maybe redact the sensitive information so we could we have the German document that does reliably reproduce the issue?
_________________
If posting files to this forum - you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded - thank you.

Best regards

Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com

Post Reply