Enhanced OCR
Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan
Enhanced OCR
Hi,
enhanced OCR:
1. did not recognizes the Currency symbol € "Euro" (you find it in *every* invoice in europe)
2. The uppercase letter of the german Umlaut "Ö" is recognized not very often
3. European write the digit "1" always with an upstroke, but the engine often show "1" when slash /., uppercase i, lowercase L, and so on
4. it seems the german dictionary prevents good results. Original word was "öffentlich" ocr makes it to "ordentlich" a complet different word, different characters.
Please improve
enhanced OCR:
1. did not recognizes the Currency symbol € "Euro" (you find it in *every* invoice in europe)
2. The uppercase letter of the german Umlaut "Ö" is recognized not very often
3. European write the digit "1" always with an upstroke, but the engine often show "1" when slash /., uppercase i, lowercase L, and so on
4. it seems the german dictionary prevents good results. Original word was "öffentlich" ocr makes it to "ordentlich" a complet different word, different characters.
Please improve
- Paul - Tracker Supp
- Site Admin
- Posts: 6900
- Joined: Wed Mar 25, 2009 10:37 pm
- Location: Chemainus, Canada
- Contact:
Re: Enhanced OCR
Hi Markus, and welcome to the Tracker Forums.
Thank you for that post. We would indeed like to investigate this. Can you please send us your document for testing please? If it is too large for the forum, you can email it to support@pdf-xchange.com.
If it is even too large for that you can upload it to https://useruploads.tracker-software.support/
regards
Thank you for that post. We would indeed like to investigate this. Can you please send us your document for testing please? If it is too large for the forum, you can email it to support@pdf-xchange.com.
If it is even too large for that you can upload it to https://useruploads.tracker-software.support/
regards
Best regards
Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
Re: Enhanced OCR
Just to add to Markus' comments, as shown below, I am also getting pretty poor OCR results even with the latest "Enhanced" version.
I have also attached the same file as PDF for your team to test on your end. BEFORE OCR:
AFTER OCR:
I have also attached the same file as PDF for your team to test on your end. BEFORE OCR:
AFTER OCR:
- TrackerSupp-Daniel
- Site Admin
- Posts: 8600
- Joined: Wed Jan 03, 2018 6:52 pm
Re: Enhanced OCR
Hello Substorm,
Thank you for the examples, Looking this over it seems that roughly 95% of the text came out correctly, and after some investigation and testing I found a few reasons for that.
First and foremost, the quality of the image. Note that Optical Character Recognition (OCR) relies heavily on the quality of the image and operates best in the 300-600 dpi range (which most modern scanners are capable of performing). This image is only 143 dpi, with that considered, these results are phenominal.
The second reason is the scan accuracy, the higher the accuracy of a scan, the more likely that artifacts and errors will appear in "imprefect" documents. for one that is below the optimal dpi range, you will almost always want to use the "low accuracy" mode when performing OCR, to achieve the best possible results. Doing this certainly did further improve the character recognition, leaving only one very minor mistake in that the $ in the first column was seen as a capital S. As you can see, the remainder of the text does indeed match the original.
Finally, regarding the title bar (Apple, book, big, etc.), Note that the Editor currently only fully supports Black on White text during scanning (others can work in many cases, but are not yet fully supported). This is why much of that row was missed, and the portions that were performed changed slightly in appearance. Once again, having a higher quality image would improve this situation significantly.
Kind regards,
Thank you for the examples, Looking this over it seems that roughly 95% of the text came out correctly, and after some investigation and testing I found a few reasons for that.
First and foremost, the quality of the image. Note that Optical Character Recognition (OCR) relies heavily on the quality of the image and operates best in the 300-600 dpi range (which most modern scanners are capable of performing). This image is only 143 dpi, with that considered, these results are phenominal.
The second reason is the scan accuracy, the higher the accuracy of a scan, the more likely that artifacts and errors will appear in "imprefect" documents. for one that is below the optimal dpi range, you will almost always want to use the "low accuracy" mode when performing OCR, to achieve the best possible results. Doing this certainly did further improve the character recognition, leaving only one very minor mistake in that the $ in the first column was seen as a capital S. As you can see, the remainder of the text does indeed match the original.
Finally, regarding the title bar (Apple, book, big, etc.), Note that the Editor currently only fully supports Black on White text during scanning (others can work in many cases, but are not yet fully supported). This is why much of that row was missed, and the portions that were performed changed slightly in appearance. Once again, having a higher quality image would improve this situation significantly.
Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Re: Enhanced OCR
Hi Daniel.
Sorry, I don't think I've used the right word by saying "poor". Instead, I should have said that there is some room for improvement, especially with symbols like $ and colored backgrounds. Comparing to the old version of your OCR, this enhanced release is definitely a big jump forward. Hoping to see it one day take the podium by beating Abbyy.
Thanks!
Sorry, I don't think I've used the right word by saying "poor". Instead, I should have said that there is some room for improvement, especially with symbols like $ and colored backgrounds. Comparing to the old version of your OCR, this enhanced release is definitely a big jump forward. Hoping to see it one day take the podium by beating Abbyy.
Thanks!
- TrackerSupp-Daniel
- Site Admin
- Posts: 8600
- Joined: Wed Jan 03, 2018 6:52 pm
Re: Enhanced OCR
Hello substorm,
We too hope that we can see that kind of improvement in the future. There are always teething trouble with new features like this, and we are certainly doing our best to resolve them as we go. We highly appreciate feedback like this, and will certainly make use of this file for future testing as we work on those features.
Kind regards,
We too hope that we can see that kind of improvement in the future. There are always teething trouble with new features like this, and we are certainly doing our best to resolve them as we go. We highly appreciate feedback like this, and will certainly make use of this file for future testing as we work on those features.
Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
-
- User
- Posts: 874
- Joined: Tue Jun 26, 2012 1:50 pm
Re: Enhanced OCR
Are there any plans on supporting different combination of text and background color anytime soon? Especially white text on black background should really be possible.TrackerSupp-Daniel wrote: ↑Wed May 29, 2019 5:22 pmFinally, regarding the title bar (Apple, book, big, etc.), Note that the Editor currently only fully supports Black on White text during scanning (others can work in many cases, but are not yet fully supported).
Last edited by Timur Born on Tue Feb 11, 2020 7:07 pm, edited 1 time in total.
- TrackerSupp-Daniel
- Site Admin
- Posts: 8600
- Joined: Wed Jan 03, 2018 6:52 pm
Re: Enhanced OCR
Hello Timur,
We do not have anything that I can guarantee coming soon, but it is something that we are aiming for in the future.
Kind regards,
We do not have anything that I can guarantee coming soon, but it is something that we are aiming for in the future.
Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
-
- User
- Posts: 874
- Joined: Tue Jun 26, 2012 1:50 pm
Re: Enhanced OCR
I did some testing. With enough contrast Editor's OCR is able to (partly) detect black text on colored background, like black on red. But it cannot compete with solutions like Abby Fineprint's OCR which is additionally able to detect combinations as white on black, red on black or white on red.
- TrackerSupp-Daniel
- Site Admin
- Posts: 8600
- Joined: Wed Jan 03, 2018 6:52 pm
Re: Enhanced OCR
Hello Timur,
Thank you for the input, as I mentioned earlier, this is not something that we claim to accomplish perfectly yet, and I cannot guarantee when it will be fully supported, but it is something we are aiming for In the future.
Kind regards,
Thank you for the input, as I mentioned earlier, this is not something that we claim to accomplish perfectly yet, and I cannot guarantee when it will be fully supported, but it is something we are aiming for In the future.
Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Re: Enhanced OCR
Hi,
the enhanced OCR plugin has not only problems with the German Ä,Ö or Ü and the Euro € sign. For me the main problems are the troubles to distinguish images and letters. Thus, it would be great to have an "OCR proof mode" similar to the spellchecker. The user should have the chance to decide in complicated cases which letters are recognized and if there's an image or a letter/word. This feature is well known with most OCR tools like Abby, Nuance but also with PhantomPdf. My proposal to improve the "Enhanced OCR plugin" is the integration of such an "OCR proof mode". Probably the German "umlaut" can be found with the spell checker, but it would be great, if the OCR is able to recognize them.
Cheers
Josef
the enhanced OCR plugin has not only problems with the German Ä,Ö or Ü and the Euro € sign. For me the main problems are the troubles to distinguish images and letters. Thus, it would be great to have an "OCR proof mode" similar to the spellchecker. The user should have the chance to decide in complicated cases which letters are recognized and if there's an image or a letter/word. This feature is well known with most OCR tools like Abby, Nuance but also with PhantomPdf. My proposal to improve the "Enhanced OCR plugin" is the integration of such an "OCR proof mode". Probably the German "umlaut" can be found with the spell checker, but it would be great, if the OCR is able to recognize them.
Cheers
Josef
- TrackerSupp-Daniel
- Site Admin
- Posts: 8600
- Joined: Wed Jan 03, 2018 6:52 pm
Re: Enhanced OCR
Hi, jworms
This would depend on the settings you have selected, and the quality of your documents. On clean documents, using Medium accuracy and having the German language selected, these characters should be properly detectable.
With that said, there are also times when the output is less than ideal, we are aware of these and are working closely with LeadTools (The creators of our EOCR plugin) on improving this.
Kind regards,
This would depend on the settings you have selected, and the quality of your documents. On clean documents, using Medium accuracy and having the German language selected, these characters should be properly detectable.
With that said, there are also times when the output is less than ideal, we are aware of these and are working closely with LeadTools (The creators of our EOCR plugin) on improving this.
Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Re: Enhanced OCR
Hi,
I'm aware that OCR is improved with better scan resolution, but if you get a pdf by mail, you don't have an influence to the scan properties. Nevertheless often a text based solution is wanted of such a pdf file and as already written in many cases the recognition of words ends up in completely different words. Thus, it would help to assist the OCR engine in difficult cases as described in my post.
Kind regards
Josef
I'm aware that OCR is improved with better scan resolution, but if you get a pdf by mail, you don't have an influence to the scan properties. Nevertheless often a text based solution is wanted of such a pdf file and as already written in many cases the recognition of words ends up in completely different words. Thus, it would help to assist the OCR engine in difficult cases as described in my post.
Kind regards
Josef
- TrackerSupp-Daniel
- Site Admin
- Posts: 8600
- Joined: Wed Jan 03, 2018 6:52 pm
Re: Enhanced OCR
Hi, jworms
That is understood. Unfortunately the Enhanced OCR is an engine created by a third party (LeadTools) and currently does not allow for direct interfacing or "correction" while running. While I do not believe it is likely they will decide to implement this in the near future, it may come eventually. I simply have no sway on the matter as they are a separate company.
For now, the only way to assist the OCR when it is struggling would be to edit the text after processing is complete.
Kind regards,
That is understood. Unfortunately the Enhanced OCR is an engine created by a third party (LeadTools) and currently does not allow for direct interfacing or "correction" while running. While I do not believe it is likely they will decide to implement this in the near future, it may come eventually. I simply have no sway on the matter as they are a separate company.
For now, the only way to assist the OCR when it is struggling would be to edit the text after processing is complete.
Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com