extra white spaces from OCR of full justified text on scanned pages
Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan
-
- User
- Posts: 33
- Joined: Wed Feb 03, 2021 10:21 am
extra white spaces from OCR of full justified text on scanned pages
When I scan pages that are fully justified, so that some lines have words with big spaces in between, then run OCR, the OCR (enhanced) result is multiple line breaks and white spaces in between words. So if I copy and paste the OCR text, I have to go and manually delete all the spaces and line breaks in between the words.
-
- Site Admin
- Posts: 8620
- Joined: Wed Jan 03, 2018 6:52 pm
Re: extra white spaces from OCR of full justified text on scanned pages
Hi, asdf3sthdhawer53
Unfortunately this is a limit of OCR currently, in the future it may be able to detect justification and other paragraph settings, but I cannot make any promises or offer a timeline for this.
Kind regards,
Unfortunately this is a limit of OCR currently, in the future it may be able to detect justification and other paragraph settings, but I cannot make any promises or offer a timeline for this.
Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
-
- User
- Posts: 33
- Joined: Wed Feb 03, 2021 10:21 am
Re: extra white spaces from OCR of full justified text on scanned pages
Is there an update for fixing the white spaces between characters?
Thanks
Thanks
-
- Site Admin
- Posts: 17957
- Joined: Mon Jan 12, 2009 8:07 am
- Location: London
Re: extra white spaces from OCR of full justified text on scanned pages
Hello asdf3sthdhawer53,
I just tested with the attached file. and it does seem like while the words are recognized separately - the spaces between them are not filled in with e.g. tabs or multiple "space" characters - so when this text is copied from inside the PDF and to e.g. notepad single spaces are added: Kind regards,
Stefan
I just tested with the attached file. and it does seem like while the words are recognized separately - the spaces between them are not filled in with e.g. tabs or multiple "space" characters - so when this text is copied from inside the PDF and to e.g. notepad single spaces are added: Kind regards,
Stefan
You do not have the required permissions to view the files attached to this post.
-
- User
- Posts: 33
- Joined: Wed Feb 03, 2021 10:21 am
Re: extra white spaces from OCR of full justified text on scanned pages
If you try German documents, you will see a lot of spaces in between characters inside of words. I get different results for each attempt, which can also change for the same scanned page depending on how many other pages are in the PDF.
I was going to attach a sample page, but when I delete the other pages in the PDF, the problem with the spaces went away, and instead I got some gibberish. I don't want to include the other pages in order to simulate the problem with the spaces, since there is personal data on them, so I won't share it.
I was going to attach a sample page, but when I delete the other pages in the PDF, the problem with the spaces went away, and instead I got some gibberish. I don't want to include the other pages in order to simulate the problem with the spaces, since there is personal data on them, so I won't share it.
-
- Site Admin
- Posts: 6902
- Joined: Wed Mar 25, 2009 10:37 pm
- Location: Chemainus, Canada
Re: extra white spaces from OCR of full justified text on scanned pages
HI asdf3sthdhawer53
do you think you could sanitize the documents? Maybe redact the sensitive information so we could we have the German document that does reliably reproduce the issue?
do you think you could sanitize the documents? Maybe redact the sensitive information so we could we have the German document that does reliably reproduce the issue?
Best regards
Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
-
- User
- Posts: 908
- Joined: Sat Sep 11, 2021 5:04 am
Re: extra white spaces from OCR of full justified text on scanned pages
=== UPDATE ===================================================================
The issue reported below has been resolved in Ver 10.1.0 build 380.
I appreciate all the hard work and efforts of the support and development team.
==============================================================================
Hello all,
When you use PDF-XChange Editor Plus to recognize Japanese documents by OCR, extra spaces will be recognized. Japanese, along with Chinese and Thai, is one of the few languages in the world that does not allow spaces between words. Therefore, if extra spaces are included in the recognition result, you will not get the expected result when you search. Is there any setting to prevent the recognition of extra spaces?
After initializing all settings by executing "Reset Settings" from the "Manage Settings" menu, I changed "Type" to "Searchable Image" in the output options of the "OCR Pages (Enhanced)" dialog and executed OCR. Along with the original PDF file, the output PDF file and text file are attached.
Thank you for taking the time to read this message.
Best regards,
rakunavi
- PDF-XChange Editor Plus Version:9.2 build 359.0
- OS Version: Windows 10 Home 21H2 Build 19044.1526
- PC Model: Lenovo IdeaPad C340-15IWL / HP ProDesk 600G1
The issue reported below has been resolved in Ver 10.1.0 build 380.
I appreciate all the hard work and efforts of the support and development team.
==============================================================================
Hello all,
When you use PDF-XChange Editor Plus to recognize Japanese documents by OCR, extra spaces will be recognized. Japanese, along with Chinese and Thai, is one of the few languages in the world that does not allow spaces between words. Therefore, if extra spaces are included in the recognition result, you will not get the expected result when you search. Is there any setting to prevent the recognition of extra spaces?
After initializing all settings by executing "Reset Settings" from the "Manage Settings" menu, I changed "Type" to "Searchable Image" in the output options of the "OCR Pages (Enhanced)" dialog and executed OCR. Along with the original PDF file, the output PDF file and text file are attached.
- Original PDF file ("SAMPLE.pdf" in the ZIP archive file)
- Output PDF file ("OCR_JPN.pdf" in the ZIP archive file)
- Output text file ("OCR_JPN.txt" in the ZIP archive file)
Thank you for taking the time to read this message.
Best regards,
rakunavi
- PDF-XChange Editor Plus Version:9.2 build 359.0
- OS Version: Windows 10 Home 21H2 Build 19044.1526
- PC Model: Lenovo IdeaPad C340-15IWL / HP ProDesk 600G1
You do not have the required permissions to view the files attached to this post.
Last edited by rakunavi on Thu Sep 07, 2023 11:35 am, edited 1 time in total.
TOP desires for PDFXCE
forum.pdf-xchange.com/viewtopic.php?t=39665 LassoTool
forum.pdf-xchange.com/viewtopic.php?t=38554 CmtGarbled
forum.pdf-xchange.com/viewtopic.php?t=37353 FulScrMultiMon
forum.pdf-xchange.com/viewtopic.php?t=41002 DisableTouchSelect
forum.pdf-xchange.com/viewtopic.php?t=39665 LassoTool
forum.pdf-xchange.com/viewtopic.php?t=38554 CmtGarbled
forum.pdf-xchange.com/viewtopic.php?t=37353 FulScrMultiMon
forum.pdf-xchange.com/viewtopic.php?t=41002 DisableTouchSelect
-
- Site Admin
- Posts: 8620
- Joined: Wed Jan 03, 2018 6:52 pm
Re: extra white spaces from OCR of full justified text on scanned pages
Hello, rakunavi
Thank you for the sample. I see what you mean here, and have created a ticket for the issue:
RT#5967: Bug - EOCR adds extra spaces in some languages
While I cannot offer a timeline of guarantee of a fix at the moment, I can promise you that the Dev team will take a look when they get the chance.
Regarding the difference between Japanese and Japanese (modern). To be quite honest I have no clue, The languages are provided to us by the creators of the EOCR engine (ABBYY), and we do not have any Japanese language specialists on hand to perform a comparison. I would hazard a guess that the modern variant has dictionary allowances for more common modern terms that traditional Japanese does not recognize.
Kind regards,
Thank you for the sample. I see what you mean here, and have created a ticket for the issue:
RT#5967: Bug - EOCR adds extra spaces in some languages
While I cannot offer a timeline of guarantee of a fix at the moment, I can promise you that the Dev team will take a look when they get the chance.
Regarding the difference between Japanese and Japanese (modern). To be quite honest I have no clue, The languages are provided to us by the creators of the EOCR engine (ABBYY), and we do not have any Japanese language specialists on hand to perform a comparison. I would hazard a guess that the modern variant has dictionary allowances for more common modern terms that traditional Japanese does not recognize.
Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
-
- User
- Posts: 908
- Joined: Sat Sep 11, 2021 5:04 am
Re: extra white spaces from OCR of full justified text on scanned pages
Hi Daniel,
Thank you for issuing the ticket. If you change the OCR engine from Enhanced (FineReader) to the default, unnecessary white space is almost eliminated, but you will be annoyed by the poor performance of the original OCR. As a user who chose the Plus version with high expectations for the performance of the ABBYY engine, I hope the day will come when I can use the OCR feature in PDF-XChange Editor. Until then, I will continue to rely on Acrobat X, which was released more than 10 years ago.
Also, thank you for your answer about the OCR language. Thanks to your comment, I found the following explanation of the difference between 'Japanese' and 'Japanese (Modern)' on the ABBYY website.
Thank you for your understanding.
I'm looking forward to future updates.
Best regards,
rakunavi
Thank you for issuing the ticket. If you change the OCR engine from Enhanced (FineReader) to the default, unnecessary white space is almost eliminated, but you will be annoyed by the poor performance of the original OCR. As a user who chose the Plus version with high expectations for the performance of the ABBYY engine, I hope the day will come when I can use the OCR feature in PDF-XChange Editor. Until then, I will continue to rely on Acrobat X, which was released more than 10 years ago.
Also, thank you for your answer about the OCR language. Thanks to your comment, I found the following explanation of the difference between 'Japanese' and 'Japanese (Modern)' on the ABBYY website.
This comment is written to help Japanese people who are considering whether they should choose the Plus version or not. However, considering the excellent support team and the amazing development speed of the development team, I don't regret choosing the Plus version.IMPROVED: Japanese OCR accuracy
A new OCR language was added: Japanese (Modern). Due to alphabet corrections and dictionary enhancements, this OCR language offers higher recognition accuracy of Japanese and supports recognition of documents that as well contain English words and Greek characters α, β, θ, π. We recommend using the 'Japanese (Modern)' OCR language instead of the originally available 'Japanese' OCR - the 'Japanese' OCR language will not be further optimized in the future.
https://support.abbyy.com/hc/en-us/articles/360017625500-ABBYY-FineReader-Engine-12-What-is-new
Thank you for your understanding.
I'm looking forward to future updates.
Best regards,
rakunavi
TOP desires for PDFXCE
forum.pdf-xchange.com/viewtopic.php?t=39665 LassoTool
forum.pdf-xchange.com/viewtopic.php?t=38554 CmtGarbled
forum.pdf-xchange.com/viewtopic.php?t=37353 FulScrMultiMon
forum.pdf-xchange.com/viewtopic.php?t=41002 DisableTouchSelect
forum.pdf-xchange.com/viewtopic.php?t=39665 LassoTool
forum.pdf-xchange.com/viewtopic.php?t=38554 CmtGarbled
forum.pdf-xchange.com/viewtopic.php?t=37353 FulScrMultiMon
forum.pdf-xchange.com/viewtopic.php?t=41002 DisableTouchSelect
-
- Site Admin
- Posts: 8620
- Joined: Wed Jan 03, 2018 6:52 pm
Re: extra white spaces from OCR of full justified text on scanned pages
Hello, rakunavi
And that you for your patience and understanding, that excerpt from ABBYY's webpage is also quite handy, that you for sharing your findings!
Kind regards,
And that you for your patience and understanding, that excerpt from ABBYY's webpage is also quite handy, that you for sharing your findings!
Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
-
- User
- Posts: 908
- Joined: Sat Sep 11, 2021 5:04 am
Re: extra white spaces from OCR of full justified text on scanned pages
Hello Forum and Tracker Team,
Thank you so much for your continued support.
Best regards,
rakunavi
- https://forum.pdf-xchange.com/viewtopic.php?p=158203#p158203
RT#5967: Bug - EOCR adds extra spaces in some languages
Thank you so much for your continued support.
Best regards,
rakunavi
You do not have the required permissions to view the files attached to this post.
TOP desires for PDFXCE
forum.pdf-xchange.com/viewtopic.php?t=39665 LassoTool
forum.pdf-xchange.com/viewtopic.php?t=38554 CmtGarbled
forum.pdf-xchange.com/viewtopic.php?t=37353 FulScrMultiMon
forum.pdf-xchange.com/viewtopic.php?t=41002 DisableTouchSelect
forum.pdf-xchange.com/viewtopic.php?t=39665 LassoTool
forum.pdf-xchange.com/viewtopic.php?t=38554 CmtGarbled
forum.pdf-xchange.com/viewtopic.php?t=37353 FulScrMultiMon
forum.pdf-xchange.com/viewtopic.php?t=41002 DisableTouchSelect
-
- Site Admin
- Posts: 8620
- Joined: Wed Jan 03, 2018 6:52 pm
Re: extra white spaces from OCR of full justified text on scanned pages
Hello, rakunavi
The Shift-J version (the one you said was created by adobe, looks like complete jibberish in notepad on this end: Meanwhile, the UTF-8 version, created by us, appears to be full formed and functional. When using a more advanced application (Notepad++) which can handle both encoding systems, the two files appear to be identical, with no additional spacing or anything else of the like: I am unsure what problem you are trying to convey here, but, so far as I can tell, there is no difference, and no extra spaces in our file.
Kind regards,
The Shift-J version (the one you said was created by adobe, looks like complete jibberish in notepad on this end: Meanwhile, the UTF-8 version, created by us, appears to be full formed and functional. When using a more advanced application (Notepad++) which can handle both encoding systems, the two files appear to be identical, with no additional spacing or anything else of the like: I am unsure what problem you are trying to convey here, but, so far as I can tell, there is no difference, and no extra spaces in our file.
Kind regards,
You do not have the required permissions to view the files attached to this post.
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
-
- User
- Posts: 908
- Joined: Sat Sep 11, 2021 5:04 am
Re: extra white spaces from OCR of full justified text on scanned pages
Hi Daniel,
The purpose of the previous post was to have you compare "OCR_JPN.pdf" and "OCR_AcrobatX.pdf".
Based on our exchange in another topic the other day, I thought it would be more complete for the report if I showed the problematic situation and the normal situation at the same time. In this regard, since this topic attached only the file in which the problem was occurring, I have additionally attached a set of files in the normal case.
https://forum.pdf-xchange.com/viewtopic.php?p=160976#p160976
Please understand that my posted messages are a bit verbose because we Japanese are always concerned about the risk that what we think the other party will naturally understand may not be conveyed to them as expected due to differences in the OS installation language environment, etc. When I try to write in short, sophisticated sentences like native speakers do, I feel frustrated because I am unable to convey my thoughts accurately.
Thank you for taking the time to read this message.
Best regards,
rakunavi
The purpose of the previous post was to have you compare "OCR_JPN.pdf" and "OCR_AcrobatX.pdf".
- Attached file contents of the post that reported the issue (AttachedFiles.zip) - the problematic situation -
https://forum.pdf-xchange.com/viewtopic.php?p=158203#p158203
- dialog.png
- OCR_JPN.pdf
- OCR_JPN.txt
- SAMPLE.pdf
- Attached file contents of the most recent post (OCR_AcrobatX.zip) - the normal situation -
https://forum.pdf-xchange.com/viewtopic.php?p=161034#p161034
- OCR_AcrobatX.pdf
- OCR_TXT_JPN_Shift-JIS.txt
- OCR_TXT_JPN_UTF-8.txt
Based on our exchange in another topic the other day, I thought it would be more complete for the report if I showed the problematic situation and the normal situation at the same time. In this regard, since this topic attached only the file in which the problem was occurring, I have additionally attached a set of files in the normal case.
https://forum.pdf-xchange.com/viewtopic.php?p=160976#p160976
Please understand that my posted messages are a bit verbose because we Japanese are always concerned about the risk that what we think the other party will naturally understand may not be conveyed to them as expected due to differences in the OS installation language environment, etc. When I try to write in short, sophisticated sentences like native speakers do, I feel frustrated because I am unable to convey my thoughts accurately.
Thank you for taking the time to read this message.
Best regards,
rakunavi
TOP desires for PDFXCE
forum.pdf-xchange.com/viewtopic.php?t=39665 LassoTool
forum.pdf-xchange.com/viewtopic.php?t=38554 CmtGarbled
forum.pdf-xchange.com/viewtopic.php?t=37353 FulScrMultiMon
forum.pdf-xchange.com/viewtopic.php?t=41002 DisableTouchSelect
forum.pdf-xchange.com/viewtopic.php?t=39665 LassoTool
forum.pdf-xchange.com/viewtopic.php?t=38554 CmtGarbled
forum.pdf-xchange.com/viewtopic.php?t=37353 FulScrMultiMon
forum.pdf-xchange.com/viewtopic.php?t=41002 DisableTouchSelect
-
- Site Admin
- Posts: 8620
- Joined: Wed Jan 03, 2018 6:52 pm
Re: extra white spaces from OCR of full justified text on scanned pages
Hello, rakunavi
Understood, sorry for the confusion, I originally thought that the new files were present to directly compare among themselves, not to compare against the previous files. In any case, this comparison adds little to the original report, as it was already established that the editor was adding extra spaces, and a ticket was made already showcasing that difference.
I appreciate the diligence, but you do not need to offer so much additional information each time you are looking for an update on the progress of a ticket. It has been 3 months since this was created, you are well within your rights to simply ask "has there been any new progress here?"
At the moment, I am afraid that this has not been resolved, but it is assigned to a developer now, hopefully we will see this resolved sometime in the next few releases.
Kind regards,
Understood, sorry for the confusion, I originally thought that the new files were present to directly compare among themselves, not to compare against the previous files. In any case, this comparison adds little to the original report, as it was already established that the editor was adding extra spaces, and a ticket was made already showcasing that difference.
I appreciate the diligence, but you do not need to offer so much additional information each time you are looking for an update on the progress of a ticket. It has been 3 months since this was created, you are well within your rights to simply ask "has there been any new progress here?"
At the moment, I am afraid that this has not been resolved, but it is assigned to a developer now, hopefully we will see this resolved sometime in the next few releases.
Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
-
- User
- Posts: 908
- Joined: Sat Sep 11, 2021 5:04 am
Re: extra white spaces from OCR of full justified text on scanned pages
Hi Daniel,
Thank you for taking the time to reply.
In my case, I have no particular desire to check my progress. Since posting only status checks will only bother you busy Trackers, I'm just praying to God. I think it is proper for users to be secretly happy or sad when they see the content of updates that are made approximately once every three months or so.
I am more inclined to get all the relevant information out and organized for the benefit of support personnel, developers, users who have already purchased, forum participants who are considering purchasing, and myself. Most recently, I have been relating information related to the Japanese language. This is partly to convey information that I wanted to know before I purchased the product to the current forum participants who have not yet purchased the product.
When I am actively using an application with this in mind, I often notice the tiniest bit of application behavior oddities. If I used the application in a passive manner, content with simply checking the progress, I might not notice even the slightest change in behavior. The most recent bug report was also discovered while reviewing the contents of a reply to this very topic.
https://forum.pdf-xchange.com/viewtopic.php?t=38929
In addition, I try to tie related topics together as much as possible, as I often find cases of reference or unexpected discoveries when related topics are mentioned while searching the logs. As for the following recent topics, I can get the whole situation from the first post of koide_r two and a half years ago, but like koide_r, I would not have changed anything if I had simply checked the progress of the case. But it's up to Tracker to decide what to do with this one as well, and the rest is up to God. For my part, I feel better now that I have given all the information I know.
https://forum.pdf-xchange.com/viewtopic.php?t=33717
https://forum.pdf-xchange.com/viewtopic.php?t=38554
Even after my proposal is rejected, if I do my best to convey my enthusiasm, it may reach the developer before I know it, as in the following topic.
https://forum.pdf-xchange.com/viewtopic.php?p=157988#p157988
I think that a less-than-enthusiastic post can only get a less-than-enthusiastic response. I will continue to focus on information that seems to be of high importance, such as bug reports, and will continue to convey information with a lot of passion. As long as there are various forum participants and I don't know how things will turn out, I will pack in as much information as I can. I understand that this may be a nuisance to Trackers, but I appreciate your understanding.
Best regards,
rakunavi
Thank you for taking the time to reply.
In my case, I have no particular desire to check my progress. Since posting only status checks will only bother you busy Trackers, I'm just praying to God. I think it is proper for users to be secretly happy or sad when they see the content of updates that are made approximately once every three months or so.
I am more inclined to get all the relevant information out and organized for the benefit of support personnel, developers, users who have already purchased, forum participants who are considering purchasing, and myself. Most recently, I have been relating information related to the Japanese language. This is partly to convey information that I wanted to know before I purchased the product to the current forum participants who have not yet purchased the product.
When I am actively using an application with this in mind, I often notice the tiniest bit of application behavior oddities. If I used the application in a passive manner, content with simply checking the progress, I might not notice even the slightest change in behavior. The most recent bug report was also discovered while reviewing the contents of a reply to this very topic.
https://forum.pdf-xchange.com/viewtopic.php?t=38929
In addition, I try to tie related topics together as much as possible, as I often find cases of reference or unexpected discoveries when related topics are mentioned while searching the logs. As for the following recent topics, I can get the whole situation from the first post of koide_r two and a half years ago, but like koide_r, I would not have changed anything if I had simply checked the progress of the case. But it's up to Tracker to decide what to do with this one as well, and the rest is up to God. For my part, I feel better now that I have given all the information I know.
https://forum.pdf-xchange.com/viewtopic.php?t=33717
https://forum.pdf-xchange.com/viewtopic.php?t=38554
Even after my proposal is rejected, if I do my best to convey my enthusiasm, it may reach the developer before I know it, as in the following topic.
https://forum.pdf-xchange.com/viewtopic.php?p=157988#p157988
I think that a less-than-enthusiastic post can only get a less-than-enthusiastic response. I will continue to focus on information that seems to be of high importance, such as bug reports, and will continue to convey information with a lot of passion. As long as there are various forum participants and I don't know how things will turn out, I will pack in as much information as I can. I understand that this may be a nuisance to Trackers, but I appreciate your understanding.
Best regards,
rakunavi
TOP desires for PDFXCE
forum.pdf-xchange.com/viewtopic.php?t=39665 LassoTool
forum.pdf-xchange.com/viewtopic.php?t=38554 CmtGarbled
forum.pdf-xchange.com/viewtopic.php?t=37353 FulScrMultiMon
forum.pdf-xchange.com/viewtopic.php?t=41002 DisableTouchSelect
forum.pdf-xchange.com/viewtopic.php?t=39665 LassoTool
forum.pdf-xchange.com/viewtopic.php?t=38554 CmtGarbled
forum.pdf-xchange.com/viewtopic.php?t=37353 FulScrMultiMon
forum.pdf-xchange.com/viewtopic.php?t=41002 DisableTouchSelect
-
- Site Admin
- Posts: 8620
- Joined: Wed Jan 03, 2018 6:52 pm
Re: extra white spaces from OCR of full justified text on scanned pages
Hello, rakunavi
We do appreciate you attention to detail and well formatted reports, they are usually very helpful. I simply wanted to give you an update in case that was what you were looking for here. I am sorry that it is not yet the news you are looking for, but at the very least, you can be happy that it is not something which was rejected.
In any case, have an excellent day!
Kind regards,
We do appreciate you attention to detail and well formatted reports, they are usually very helpful. I simply wanted to give you an update in case that was what you were looking for here. I am sorry that it is not yet the news you are looking for, but at the very least, you can be happy that it is not something which was rejected.
In any case, have an excellent day!
Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
-
- User
- Posts: 908
- Joined: Sat Sep 11, 2021 5:04 am
Re: extra white spaces from OCR of full justified text on scanned pages
Hi Daniel,
Best regards,
rakunavi
- PDF-XChange Editor Plus Version: 10.0.1 build 371.0
- OS Version: Windows 11 Home 22H2 Build 22621.1848
- PC Model: Lenovo IdeaPad C340-15IWL, HP All-in-One 22-c0xx
The title of this ticket might lead one to assume that the problem is limited to EOCR (Fine Reader OCR Engine), but in fact it also occurs with the default Tesseract OCR engine.TrackerSupp-Daniel wrote: ↑Fri Feb 25, 2022 5:30 pm RT#5967: Bug - EOCR adds extra spaces in some languages
Best regards,
rakunavi
- PDF-XChange Editor Plus Version: 10.0.1 build 371.0
- OS Version: Windows 11 Home 22H2 Build 22621.1848
- PC Model: Lenovo IdeaPad C340-15IWL, HP All-in-One 22-c0xx
You do not have the required permissions to view the files attached to this post.
TOP desires for PDFXCE
forum.pdf-xchange.com/viewtopic.php?t=39665 LassoTool
forum.pdf-xchange.com/viewtopic.php?t=38554 CmtGarbled
forum.pdf-xchange.com/viewtopic.php?t=37353 FulScrMultiMon
forum.pdf-xchange.com/viewtopic.php?t=41002 DisableTouchSelect
forum.pdf-xchange.com/viewtopic.php?t=39665 LassoTool
forum.pdf-xchange.com/viewtopic.php?t=38554 CmtGarbled
forum.pdf-xchange.com/viewtopic.php?t=37353 FulScrMultiMon
forum.pdf-xchange.com/viewtopic.php?t=41002 DisableTouchSelect
-
- Site Admin
- Posts: 8620
- Joined: Wed Jan 03, 2018 6:52 pm
Re: extra white spaces from OCR of full justified text on scanned pages
Hello, rakunavi
Thankfully the two are linked to enough of a degree that both should be fixed at the same time, do not worry, the team is aware that the same happens without the ABBYY OCR engine specifically being active. Unfortunately, after speaking with the team, it seems that this issue is proving to be quite difficult to resolve, and will still take quite some time... Allow me to extend our apologies for this.
Kind regards,
Thankfully the two are linked to enough of a degree that both should be fixed at the same time, do not worry, the team is aware that the same happens without the ABBYY OCR engine specifically being active. Unfortunately, after speaking with the team, it seems that this issue is proving to be quite difficult to resolve, and will still take quite some time... Allow me to extend our apologies for this.
Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
-
- User
- Posts: 908
- Joined: Sat Sep 11, 2021 5:04 am
Re: extra white spaces from OCR of full justified text on scanned pages
Hi Daniel, thank you for your reply.
As for the Tesseract engine, I understand that I have no choice but to wait for improvement of the Tesseract engine itself, especially in non-space delimited languages, since many problems have been reported regarding the recognition of extra spaces.
However, as for the FineReader engine, ABBYY is originally a Russian company and its development history is completely different from that of the Tesseract engine. Third-party PDF applications using the same version 12 FineReader engine as EOCR in PDF-XChange Editor Plus do not seem to have the same problem. Therefore, I personally think that there is a good chance that the problem can be solved by improvements on the PDF-XChange Editor side.
The following verification shows text recognition results with the latest version of Foxit PDF Editor Pro 12.0.1.12430 using the FineReader version 12 engine. The version of FREngine.dll is 12.4.7.63 in PDF-XChange Editor Plus, but almost the same as 12.5.6.0 in Foxit PDF Editor Pro.
Thank you for taking the time to read this message.
Best regards,
rakunavi
- PDF-XChange Editor Plus Version: 10.0.1 build 371.0
- OS Version: Windows 11 Home 22H2 Build 22621.1848
- PC Model: Lenovo IdeaPad C340-15IWL, HP All-in-One 22-c0xx
As for the Tesseract engine, I understand that I have no choice but to wait for improvement of the Tesseract engine itself, especially in non-space delimited languages, since many problems have been reported regarding the recognition of extra spaces.
However, as for the FineReader engine, ABBYY is originally a Russian company and its development history is completely different from that of the Tesseract engine. Third-party PDF applications using the same version 12 FineReader engine as EOCR in PDF-XChange Editor Plus do not seem to have the same problem. Therefore, I personally think that there is a good chance that the problem can be solved by improvements on the PDF-XChange Editor side.
The following verification shows text recognition results with the latest version of Foxit PDF Editor Pro 12.0.1.12430 using the FineReader version 12 engine. The version of FREngine.dll is 12.4.7.63 in PDF-XChange Editor Plus, but almost the same as 12.5.6.0 in Foxit PDF Editor Pro.
- FOXIT.pdf in PDF-XChange Editor
- FOXIT_JPN.pdf in PDF-XChange Editor
- FOXIT_JPN.pdf in Microsoft Edge built-in PDF viewer
- FOXIT_JPN.pdf in Acrobat Reader
Thank you for taking the time to read this message.
Best regards,
rakunavi
- PDF-XChange Editor Plus Version: 10.0.1 build 371.0
- OS Version: Windows 11 Home 22H2 Build 22621.1848
- PC Model: Lenovo IdeaPad C340-15IWL, HP All-in-One 22-c0xx
You do not have the required permissions to view the files attached to this post.
TOP desires for PDFXCE
forum.pdf-xchange.com/viewtopic.php?t=39665 LassoTool
forum.pdf-xchange.com/viewtopic.php?t=38554 CmtGarbled
forum.pdf-xchange.com/viewtopic.php?t=37353 FulScrMultiMon
forum.pdf-xchange.com/viewtopic.php?t=41002 DisableTouchSelect
forum.pdf-xchange.com/viewtopic.php?t=39665 LassoTool
forum.pdf-xchange.com/viewtopic.php?t=38554 CmtGarbled
forum.pdf-xchange.com/viewtopic.php?t=37353 FulScrMultiMon
forum.pdf-xchange.com/viewtopic.php?t=41002 DisableTouchSelect
-
- Site Admin
- Posts: 8620
- Joined: Wed Jan 03, 2018 6:52 pm
Re: extra white spaces from OCR of full justified text on scanned pages
Hello, rakunavi
Thank you for the slow rendering sample, I have created a ticket on that front for you:
RT#6504: Slow rendering of files with large font quantity.
As for the longstanding issue with extra whitespace, We do recognize that this is an issue within our software. Unfortunately it is, as I have mentioned before, a very complicated issue, and one which not many of our developers are familiar with the systems to make changes for it. This even more limited than our usual pool of resources, combined with the complexity of the issue, and the language barrier that interferes with troubleshooting, means it is likely going to be a more longstanding issue than I had originally hoped.
Kind regards,
Thank you for the slow rendering sample, I have created a ticket on that front for you:
RT#6504: Slow rendering of files with large font quantity.
As for the longstanding issue with extra whitespace, We do recognize that this is an issue within our software. Unfortunately it is, as I have mentioned before, a very complicated issue, and one which not many of our developers are familiar with the systems to make changes for it. This even more limited than our usual pool of resources, combined with the complexity of the issue, and the language barrier that interferes with troubleshooting, means it is likely going to be a more longstanding issue than I had originally hoped.
Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
-
- User
- Posts: 908
- Joined: Sat Sep 11, 2021 5:04 am
Re: extra white spaces from OCR of full justified text on scanned pages
Hi Daniel, thank you for creating the ticket.
Each time I report a new bug, I report it with mixed feelings, concerned that it will further lower your already low priority on this issue. However, I will continue to wait and see with a glimmer of hope. Until then, I just hope that my current Acrobat X keeps working. Please give my regards to the developer.
Best regards,
rakunavi
Each time I report a new bug, I report it with mixed feelings, concerned that it will further lower your already low priority on this issue. However, I will continue to wait and see with a glimmer of hope. Until then, I just hope that my current Acrobat X keeps working. Please give my regards to the developer.
Best regards,
rakunavi
TOP desires for PDFXCE
forum.pdf-xchange.com/viewtopic.php?t=39665 LassoTool
forum.pdf-xchange.com/viewtopic.php?t=38554 CmtGarbled
forum.pdf-xchange.com/viewtopic.php?t=37353 FulScrMultiMon
forum.pdf-xchange.com/viewtopic.php?t=41002 DisableTouchSelect
forum.pdf-xchange.com/viewtopic.php?t=39665 LassoTool
forum.pdf-xchange.com/viewtopic.php?t=38554 CmtGarbled
forum.pdf-xchange.com/viewtopic.php?t=37353 FulScrMultiMon
forum.pdf-xchange.com/viewtopic.php?t=41002 DisableTouchSelect
-
- Site Admin
- Posts: 8620
- Joined: Wed Jan 03, 2018 6:52 pm
Re: extra white spaces from OCR of full justified text on scanned pages
Hello, rakunavi
That is understandable, Development is very much a balancing act, and some bugs may be more prominent, or easier to resolve than others. I do hope we can resolve this in a shorter timeframe than it has been so far.
Kind regards,
That is understandable, Development is very much a balancing act, and some bugs may be more prominent, or easier to resolve than others. I do hope we can resolve this in a shorter timeframe than it has been so far.
Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com