extra white spaces from OCR of full justified text on scanned pages

asdf3sthdhawer53 · Post by **asdf3sthdhawer53** » Wed Feb 03, 2021 11:59 am

When I scan pages that are fully justified, so that some lines have words with big spaces in between, then run OCR, the OCR (enhanced) result is multiple line breaks and white spaces in between words. So if I copy and paste the OCR text, I have to go and manually delete all the spaces and line breaks in between the words.

Post by **TrackerSupp-Daniel** » Thu Feb 04, 2021 12:10 am

Hi, asdf3sthdhawer53

Unfortunately this is a limit of OCR currently, in the future it may be able to detect justification and other paragraph settings, but I cannot make any promises or offer a timeline for this.

Kind regards,

asdf3sthdhawer53 · Post by **asdf3sthdhawer53** » Sat Nov 20, 2021 2:18 pm

Is there an update for fixing the white spaces between characters?
Thanks

Mon Nov 22, 2021 11:25 am

Hello asdf3sthdhawer53,

I just tested with the attached file. and it does seem like while the words are recognized separately - the spaces between them are not filled in with e.g. tabs or multiple "space" characters - so when this text is copied from inside the PDF and to e.g. notepad single spaces are added:

image.png

image1.png

Kind regards,
Stefan

asdf3sthdhawer53 · Post by **asdf3sthdhawer53** » Mon Nov 22, 2021 3:07 pm

If you try German documents, you will see a lot of spaces in between characters inside of words. I get different results for each attempt, which can also change for the same scanned page depending on how many other pages are in the PDF.

I was going to attach a sample page, but when I delete the other pages in the PDF, the problem with the spaces went away, and instead I got some gibberish. I don't want to include the other pages in order to simulate the problem with the spaces, since there is personal data on them, so I won't share it.

Post by **Paul - Tracker Supp** » Mon Nov 22, 2021 5:03 pm

HI asdf3sthdhawer53

do you think you could sanitize the documents? Maybe redact the sensitive information so we could we have the German document that does reliably reproduce the issue?

rakunavi · Post by **rakunavi** » Fri Feb 25, 2022 4:40 am

=== UPDATE ===================================================================
The issue reported below has been resolved in Ver 10.1.0 build 380.
I appreciate all the hard work and efforts of the support and development team.
==============================================================================

Hello all,

When you use PDF-XChange Editor Plus to recognize Japanese documents by OCR, extra spaces will be recognized. Japanese, along with Chinese and Thai, is one of the few languages in the world that does not allow spaces between words. Therefore, if extra spaces are included in the recognition result, you will not get the expected result when you search. Is there any setting to prevent the recognition of extra spaces?

After initializing all settings by executing "Reset Settings" from the "Manage Settings" menu, I changed "Type" to "Searchable Image" in the output options of the "OCR Pages (Enhanced)" dialog and executed OCR. Along with the original PDF file, the output PDF file and text file are attached.

Original PDF file ("SAMPLE.pdf" in the ZIP archive file)
Output PDF file ("OCR_JPN.pdf" in the ZIP archive file)
Output text file ("OCR_JPN.txt" in the ZIP archive file)

Furthermore, in the OCR dialog, you can select "Japanese (Modern)" in addition to the default "Japanese" in the "Languages", but what is the difference between the two? Please refer to "dialog.png" in the attached ZIP archive file.

Thank you for taking the time to read this message.

Best regards,
rakunavi

- PDF-XChange Editor Plus Version:9.2 build 359.0
- OS Version: Windows 10 Home 21H2 Build 19044.1526
- PC Model: Lenovo IdeaPad C340-15IWL / HP ProDesk 600G1

AttachedFiles.zip

Post by **TrackerSupp-Daniel** » Fri Feb 25, 2022 5:30 pm

Hello, rakunavi

Thank you for the sample. I see what you mean here, and have created a ticket for the issue:
RT#5967: Bug - EOCR adds extra spaces in some languages

While I cannot offer a timeline of guarantee of a fix at the moment, I can promise you that the Dev team will take a look when they get the chance.

Regarding the difference between Japanese and Japanese (modern). To be quite honest I have no clue, The languages are provided to us by the creators of the EOCR engine (ABBYY), and we do not have any Japanese language specialists on hand to perform a comparison. I would hazard a guess that the modern variant has dictionary allowances for more common modern terms that traditional Japanese does not recognize.

Kind regards,

rakunavi · Post by **rakunavi** » Fri Feb 25, 2022 10:23 pm

Hi Daniel,

Thank you for issuing the ticket. If you change the OCR engine from Enhanced (FineReader) to the default, unnecessary white space is almost eliminated, but you will be annoyed by the poor performance of the original OCR. As a user who chose the Plus version with high expectations for the performance of the ABBYY engine, I hope the day will come when I can use the OCR feature in PDF-XChange Editor. Until then, I will continue to rely on Acrobat X, which was released more than 10 years ago.

Also, thank you for your answer about the OCR language. Thanks to your comment, I found the following explanation of the difference between 'Japanese' and 'Japanese (Modern)' on the ABBYY website.

IMPROVED: Japanese OCR accuracy
A new OCR language was added: Japanese (Modern). Due to alphabet corrections and dictionary enhancements, this OCR language offers higher recognition accuracy of Japanese and supports recognition of documents that as well contain English words and Greek characters α, β, θ, π. We recommend using the 'Japanese (Modern)' OCR language instead of the originally available 'Japanese' OCR - the 'Japanese' OCR language will not be further optimized in the future.

https://support.abbyy.com/hc/en-us/articles/360017625500-ABBYY-FineReader-Engine-12-What-is-new

This comment is written to help Japanese people who are considering whether they should choose the Plus version or not. However, considering the excellent support team and the amazing development speed of the development team, I don't regret choosing the Plus version.

Thank you for your understanding.
I'm looking forward to future updates.

Best regards,
rakunavi

Post by **TrackerSupp-Daniel** » Fri Feb 25, 2022 10:27 pm

Hello, rakunavi

And that you for your patience and understanding, that excerpt from ABBYY's webpage is also quite handy, that you for sharing your findings!

Kind regards,

rakunavi · Post by **rakunavi** » Tue Jun 07, 2022 9:44 pm

Hello Forum and Tracker Team,

https://forum.pdf-xchange.com/viewtopic.php?p=158203#p158203
RT#5967: Bug - EOCR adds extra spaces in some languages

Regarding the above issue reported previously, I have attached the OCR results (PDF and TXT) as recognized by Acrobat X. OCR was performed on the SAMPLE.PDF included in the attachment to the above posting. The default encoding for text output in Acrobat X is Shift-JIS, but I have included UTF-8 encoded text too. This result is normal from a Japanese speaker's point of view. I thought it would be better to have normal results as well as abnormal results, so that I could more accurately convey the situation.

OCR_AcrobatX.zip

Hoping that the above information will be of some help to you.
Thank you so much for your continued support.

Best regards,
rakunavi

Post by **TrackerSupp-Daniel** » Wed Jun 08, 2022 5:07 pm

Hello, rakunavi

The Shift-J version (the one you said was created by adobe, looks like complete jibberish in notepad on this end:

image.png

Meanwhile, the UTF-8 version, created by us, appears to be full formed and functional.

image1.png

When using a more advanced application (Notepad++) which can handle both encoding systems, the two files appear to be identical, with no additional spacing or anything else of the like:

notepad++_b3mMEzbJhO.gif

I am unsure what problem you are trying to convey here, but, so far as I can tell, there is no difference, and no extra spaces in our file.

Kind regards,

rakunavi · Post by **rakunavi** » Wed Jun 08, 2022 11:14 pm

Hi Daniel,

The purpose of the previous post was to have you compare "OCR_JPN.pdf" and "OCR_AcrobatX.pdf".

Attached file contents of the post that reported the issue (AttachedFiles.zip) - the problematic situation -
https://forum.pdf-xchange.com/viewtopic.php?p=158203#p158203
- dialog.png
- OCR_JPN.pdf
- OCR_JPN.txt
- SAMPLE.pdf

Attached file contents of the most recent post (OCR_AcrobatX.zip) - the normal situation -
https://forum.pdf-xchange.com/viewtopic.php?p=161034#p161034
- OCR_AcrobatX.pdf
- OCR_TXT_JPN_Shift-JIS.txt
- OCR_TXT_JPN_UTF-8.txt

The text files were included for reference, but since the default encoding of Acrobat X is Shift-JIS, I first output the file as they are, and then included the UTF-8 file. Since the environment of the person viewing the file is unknown, I have attached the text in multiple encodings so that he/she can understand it as much as possible. Therefore, I don't intend to compare "OCR_TXT_JPN_Shift-JIS.txt" and "OCR_TXT_JPN_UTF-8.txt" as they are exactly the same file.

Based on our exchange in another topic the other day, I thought it would be more complete for the report if I showed the problematic situation and the normal situation at the same time. In this regard, since this topic attached only the file in which the problem was occurring, I have additionally attached a set of files in the normal case.

https://forum.pdf-xchange.com/viewtopic.php?p=160976#p160976

Please understand that my posted messages are a bit verbose because we Japanese are always concerned about the risk that what we think the other party will naturally understand may not be conveyed to them as expected due to differences in the OS installation language environment, etc. When I try to write in short, sophisticated sentences like native speakers do, I feel frustrated because I am unable to convey my thoughts accurately.

Thank you for taking the time to read this message.

Best regards,
rakunavi

Post by **TrackerSupp-Daniel** » Thu Jun 09, 2022 5:22 pm

Hello, rakunavi

Understood, sorry for the confusion, I originally thought that the new files were present to directly compare among themselves, not to compare against the previous files. In any case, this comparison adds little to the original report, as it was already established that the editor was adding extra spaces, and a ticket was made already showcasing that difference.

I appreciate the diligence, but you do not need to offer so much additional information each time you are looking for an update on the progress of a ticket. It has been 3 months since this was created, you are well within your rights to simply ask "has there been any new progress here?"

At the moment, I am afraid that this has not been resolved, but it is assigned to a developer now, hopefully we will see this resolved sometime in the next few releases.

Kind regards,

rakunavi · Post by **rakunavi** » Thu Jun 09, 2022 10:47 pm

Hi Daniel,

Thank you for taking the time to reply.

In my case, I have no particular desire to check my progress. Since posting only status checks will only bother you busy Trackers, I'm just praying to God. I think it is proper for users to be secretly happy

or sad

when they see the content of updates that are made approximately once every three months or so.

I am more inclined to get all the relevant information out and organized for the benefit of support personnel, developers, users who have already purchased, forum participants who are considering purchasing, and myself. Most recently, I have been relating information related to the Japanese language. This is partly to convey information that I wanted to know before I purchased the product to the current forum participants who have not yet purchased the product.

When I am actively using an application with this in mind, I often notice the tiniest bit of application behavior oddities. If I used the application in a passive manner, content with simply checking the progress, I might not notice even the slightest change in behavior. The most recent bug report was also discovered while reviewing the contents of a reply to this very topic.

https://forum.pdf-xchange.com/viewtopic.php?t=38929

In addition, I try to tie related topics together as much as possible, as I often find cases of reference or unexpected discoveries when related topics are mentioned while searching the logs. As for the following recent topics, I can get the whole situation from the first post of koide_r two and a half years ago, but like koide_r, I would not have changed anything if I had simply checked the progress of the case. But it's up to Tracker to decide what to do with this one as well, and the rest is up to God. For my part, I feel better now that I have given all the information I know.

https://forum.pdf-xchange.com/viewtopic.php?t=33717
https://forum.pdf-xchange.com/viewtopic.php?t=38554

Even after my proposal is rejected, if I do my best to convey my enthusiasm, it may reach the developer before I know it, as in the following topic.

https://forum.pdf-xchange.com/viewtopic.php?p=157988#p157988

I think that a less-than-enthusiastic post can only get a less-than-enthusiastic response. I will continue to focus on information that seems to be of high importance, such as bug reports, and will continue to convey information with a lot of passion. As long as there are various forum participants and I don't know how things will turn out, I will pack in as much information as I can. I understand that this may be a nuisance to Trackers, but I appreciate your understanding.

Best regards,
rakunavi

Post by **TrackerSupp-Daniel** » Thu Jun 09, 2022 11:19 pm

Hello, rakunavi

We do appreciate you attention to detail and well formatted reports, they are usually very helpful. I simply wanted to give you an update in case that was what you were looking for here. I am sorry that it is not yet the news you are looking for, but at the very least, you can be happy that it is not something which was rejected.

In any case, have an excellent day!

Kind regards,

rakunavi · Post by **rakunavi** » Thu Jun 29, 2023 10:09 pm

Hi Daniel,

TrackerSupp-Daniel wrote: ↑Fri Feb 25, 2022 5:30 pm RT#5967: Bug - EOCR adds extra spaces in some languages

The title of this ticket might lead one to assume that the problem is limited to EOCR (Fine Reader OCR Engine), but in fact it also occurs with the default Tesseract OCR engine.

CapturedVideoWithSampleFile.zip

Japanese character recognition has been practically non-functional for a long time. Plus edition users will be especially disappointed, as they purchase the Plus edition expecting the OCR feature to work. At the very least, any user who reads the product description by the Japanese distributor and buys the product should expect that it can recognize Japanese characters. I strongly hope that the problem will be resolved as soon as possible.

Best regards,
rakunavi

- PDF-XChange Editor Plus Version: 10.0.1 build 371.0
- OS Version: Windows 11 Home 22H2 Build 22621.1848
- PC Model: Lenovo IdeaPad C340-15IWL, HP All-in-One 22-c0xx

Post by **TrackerSupp-Daniel** » Thu Jun 29, 2023 11:24 pm

Hello, rakunavi

Thankfully the two are linked to enough of a degree that both should be fixed at the same time, do not worry, the team is aware that the same happens without the ABBYY OCR engine specifically being active. Unfortunately, after speaking with the team, it seems that this issue is proving to be quite difficult to resolve, and will still take quite some time... Allow me to extend our apologies for this.

Kind regards,

rakunavi · Post by **rakunavi** » Mon Jul 03, 2023 1:57 am

Hi Daniel, thank you for your reply.

As for the Tesseract engine, I understand that I have no choice but to wait for improvement of the Tesseract engine itself, especially in non-space delimited languages, since many problems have been reported regarding the recognition of extra spaces.

However, as for the FineReader engine, ABBYY is originally a Russian company and its development history is completely different from that of the Tesseract engine. Third-party PDF applications using the same version 12 FineReader engine as EOCR in PDF-XChange Editor Plus do not seem to have the same problem. Therefore, I personally think that there is a good chance that the problem can be solved by improvements on the PDF-XChange Editor side.

The following verification shows text recognition results with the latest version of Foxit PDF Editor Pro 12.0.1.12430 using the FineReader version 12 engine. The version of FREngine.dll is 12.4.7.63 in PDF-XChange Editor Plus, but almost the same as 12.5.6.0 in Foxit PDF Editor Pro.

Foxit_FREngine.png

The same sample file "SAMPLE.pdf" is used in the verification as previously reported. I installed Foxit PDF Editor Pro on both English and Japanese versions of Windows 10, and saved the text recognition results in the video. Both Windows are the original versions, with no language packs applied. In both cases, please check that little or no extra white space is recognized. As you can see, the operation is exactly the same in both videos, just the text is recognized and embedded as transparent text, nothing special operation is done.

CapturedVideoWithSampleFiles.zip

The above video may not seem worth watching at first glance, but when PDF files ("FOXIT.pdf" and "FOXIT_JPN.pdf") recognized by Foxit PDF Editor Pro are opened in PDF-XChange Editor, only the "FOXIT_JPN.pdf" recognized by Foxit PDF Editor Pro, on the Japanese version of Windows 10, are rendered unusually slow. The following verification video shows the files in the following order on Japanese Windows 11 with the English language pack applied.

FOXIT.pdf in PDF-XChange Editor
FOXIT_JPN.pdf in PDF-XChange Editor
FOXIT_JPN.pdf in Microsoft Edge built-in PDF viewer
FOXIT_JPN.pdf in Acrobat Reader

Please note that the rendering speed is much worse only when "FOXIT_JPN.pdf" is displayed in PDF-XChange Editor.

FontCountsComparison.png

The font information in the document properties shows that the number of fonts in "FOXIT.pdf" is 4, while the number of fonts in "FOXIT_JPN".pdf is 5004, which is unusually large, and I assume that this might be the cause. Of course, this is not something that should be reported to Tracker Software, since the cause of the large difference in font count is in Foxit PDF Editor Pro. However, the built-in PDF viewer in Microsoft Edge displays the PDF without any problem, and Acrobat Reader also displays the PDF much faster than the PDF-XChange Editor, although there is a little bit of lag.

SlowRendering.zip

The above verification was done using the trial version of Foxit PDF Editor Pro, so there is a slight possibility that the restriction of an abnormally large number of fonts is intentionally imposed as a limitation similar to the watermark in the trial version of PDF-XChange Editor. As I do not have a licensed version of Foxit PDF Editor Pro, I cannot verify this further, but I report this for your reference. The last mentioned problem is not related to the content of this topic, but there might possibly be room for improvement even on the PDF-XChange Editor side.

Thank you for taking the time to read this message.

Best regards,
rakunavi

- PDF-XChange Editor Plus Version: 10.0.1 build 371.0
- OS Version: Windows 11 Home 22H2 Build 22621.1848
- PC Model: Lenovo IdeaPad C340-15IWL, HP All-in-One 22-c0xx

Post by **TrackerSupp-Daniel** » Mon Jul 03, 2023 6:16 pm

Hello, rakunavi

Thank you for the slow rendering sample, I have created a ticket on that front for you:
RT#6504: Slow rendering of files with large font quantity.

As for the longstanding issue with extra whitespace, We do recognize that this is an issue within our software. Unfortunately it is, as I have mentioned before, a very complicated issue, and one which not many of our developers are familiar with the systems to make changes for it. This even more limited than our usual pool of resources, combined with the complexity of the issue, and the language barrier that interferes with troubleshooting, means it is likely going to be a more longstanding issue than I had originally hoped.

Kind regards,

rakunavi · Post by **rakunavi** » Mon Jul 03, 2023 11:45 pm

Hi Daniel, thank you for creating the ticket.

Each time I report a new bug, I report it with mixed feelings, concerned that it will further lower your already low priority on this issue. However, I will continue to wait and see with a glimmer of hope. Until then, I just hope that my current Acrobat X keeps working. Please give my regards to the developer.

Best regards,
rakunavi

Post by **TrackerSupp-Daniel** » Tue Jul 04, 2023 3:13 pm

Hello, rakunavi

That is understandable, Development is very much a balancing act, and some bugs may be more prominent, or easier to resolve than others. I do hope we can resolve this in a shorter timeframe than it has been so far.

Kind regards,

extra white spaces from OCR of full justified text on scanned pages

extra white spaces from OCR of full justified text on scanned pages

Re: extra white spaces from OCR of full justified text on scanned pages

Re: extra white spaces from OCR of full justified text on scanned pages

Re: extra white spaces from OCR of full justified text on scanned pages

Re: extra white spaces from OCR of full justified text on scanned pages

Re: extra white spaces from OCR of full justified text on scanned pages

Re: extra white spaces from OCR of full justified text on scanned pages

Re: extra white spaces from OCR of full justified text on scanned pages

Re: extra white spaces from OCR of full justified text on scanned pages

Re: extra white spaces from OCR of full justified text on scanned pages

Re: extra white spaces from OCR of full justified text on scanned pages

Re: extra white spaces from OCR of full justified text on scanned pages

Re: extra white spaces from OCR of full justified text on scanned pages

Re: extra white spaces from OCR of full justified text on scanned pages

Re: extra white spaces from OCR of full justified text on scanned pages

Re: extra white spaces from OCR of full justified text on scanned pages

Re: extra white spaces from OCR of full justified text on scanned pages

Re: extra white spaces from OCR of full justified text on scanned pages

Re: extra white spaces from OCR of full justified text on scanned pages

Re: extra white spaces from OCR of full justified text on scanned pages

Re: extra white spaces from OCR of full justified text on scanned pages

Re: extra white spaces from OCR of full justified text on scanned pages