Multiple OCR Runs -> Duplicate Text Objects
Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan
Multiple OCR Runs -> Duplicate Text Objects
I'm using v8.0.337.0 with the Enhanced OCR Plugin with the Output Type set to Searchable Image. I've noticed that multiple runs of the OCR feature add (instead of replace) the invisible text objects each time, which can be seen by viewing the Content pane. Is there an option available, or would development consider adding one if not, to have OCR remove all invisible text objects before running?
I realize that it might seem silly to run OCR multiple times, but this is handy to do when the original OCR was done by another program/device such a scanner that has a poorer quality OCR engine or when I want to try different settings (such as Accuracy) using your engine.
- Anthony
I realize that it might seem silly to run OCR multiple times, but this is handy to do when the original OCR was done by another program/device such a scanner that has a poorer quality OCR engine or when I want to try different settings (such as Accuracy) using your engine.
- Anthony
- TrackerSupp-Daniel
- Site Admin
- Posts: 8613
- Joined: Wed Jan 03, 2018 6:52 pm
Re: Multiple OCR Runs -> Duplicate Text Objects
Hi, ARM07470
You can set our OCR engine to ignore areas of the page which already contain text content (including invisible text) however it is not possible to automatically remove that text prior to placing the new content. The best method to achieve what you are looking for, is to use the "Edit > text" tool, then highlight the text on the page (or enter the content pane and use "Select text" to highlight the entire documents text) and manually delete it like that before running OCR.
Kind regards,
You can set our OCR engine to ignore areas of the page which already contain text content (including invisible text) however it is not possible to automatically remove that text prior to placing the new content. The best method to achieve what you are looking for, is to use the "Edit > text" tool, then highlight the text on the page (or enter the content pane and use "Select text" to highlight the entire documents text) and manually delete it like that before running OCR.
Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Re: Multiple OCR Runs -> Duplicate Text Objects
The "ignore existing text on page" has no effect on the behavior I'm describing -- multiple successive runs of the OCR tool result in additional, duplicate, invisible text being added to the document every time. I can't quite figure out the point of this option in the first place. Why would one ever OCR something that is already stored as text?
Quick sidebar...the "Skip pages that already contain text" option is not mentioned in the documentation besides being shown in the screenshots. I was wondering if this only skipped pages with invisible text or if it also skipped pages with visible text and was hoping that the documentation would tell me. Through experimentation, I found that it skips pages with any text, visible or invisible.
I already stumbled upon the workaround you described to delete the invisible text before running OCR and it works fine but I'd still like to see an option added to do this for me automatically. If the developers are concerned that this might be too confusing to put on the dialog, then maybe it could be added Preferences -> OCR Settings so it would be more out of the way. I'd call it "Remove invisible text before OCR" or something like that. By the way, this appears to be the default behavior of Acrobat, so you wouldn't be alone in implementing this functionality.
- Anthony
Quick sidebar...the "Skip pages that already contain text" option is not mentioned in the documentation besides being shown in the screenshots. I was wondering if this only skipped pages with invisible text or if it also skipped pages with visible text and was hoping that the documentation would tell me. Through experimentation, I found that it skips pages with any text, visible or invisible.
I already stumbled upon the workaround you described to delete the invisible text before running OCR and it works fine but I'd still like to see an option added to do this for me automatically. If the developers are concerned that this might be too confusing to put on the dialog, then maybe it could be added Preferences -> OCR Settings so it would be more out of the way. I'd call it "Remove invisible text before OCR" or something like that. By the way, this appears to be the default behavior of Acrobat, so you wouldn't be alone in implementing this functionality.
- Anthony
- TrackerSupp-Daniel
- Site Admin
- Posts: 8613
- Joined: Wed Jan 03, 2018 6:52 pm
Re: Multiple OCR Runs -> Duplicate Text Objects
Hi, ARM07470
Can I ask you to please send us a copy of a document which this is not working on and a screenshot of your current OCR settings?
The Ignore Existing text function is designed for those who have a partially editable document and want to scan it to locate the missing text items. Say you just added a few images into your document, or scanned a few pages and added them in. It is handy to just check that box and have every page scanned, then have the engine ignore areas where text already exists and only add the new items in, or for people like yourself, who decided to run OCR multiple times.
For your second question, yes, Skip pages that already contain text will skip any pages which contain any text content whatsoever, even the page numbers added with the Header/Footer tool will prevent this from running on a page.
As for the automated removal of text during OCR, I cannot make any promises, but I will pass it along to the Team and see what they think of it.
Kind regards,
Can I ask you to please send us a copy of a document which this is not working on and a screenshot of your current OCR settings?
The Ignore Existing text function is designed for those who have a partially editable document and want to scan it to locate the missing text items. Say you just added a few images into your document, or scanned a few pages and added them in. It is handy to just check that box and have every page scanned, then have the engine ignore areas where text already exists and only add the new items in, or for people like yourself, who decided to run OCR multiple times.
For your second question, yes, Skip pages that already contain text will skip any pages which contain any text content whatsoever, even the page numbers added with the Header/Footer tool will prevent this from running on a page.
As for the automated removal of text during OCR, I cannot make any promises, but I will pass it along to the Team and see what they think of it.
Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Re: Multiple OCR Runs -> Duplicate Text Objects
I've attached a sample PDF and a screen shot of my OCR settings. Note that I ran these same settings twice and see the duplicate text content I've highlighted at the left of the screen shot.
- Anthony
- Anthony
- Attachments
-
- 2X OCR Test.pdf
- (1.99 MiB) Downloaded 200 times
- Dimitar - Tracker Supp
- Site Admin
- Posts: 1797
- Joined: Mon Jan 15, 2018 9:01 am
Re: Multiple OCR Runs -> Duplicate Text Objects
Hello Anthony,
Thank you for the provided file.
Could you please provide us also with the file before the conversion?
Regards.
Thank you for the provided file.
Could you please provide us also with the file before the conversion?
Regards.
Re: Multiple OCR Runs -> Duplicate Text Objects
I didn't retain the original. It was created from scratch in PDF X-Change via the Scan function. You can remove everything but the Image element and you'd be back to where I started. Also, I don't think there is anything special about this document with regard to the behavior I'm describing. I've seen it with a variety of PDFs from different sources (Acrobat, MFP, etc.).
- Anthony
- Anthony
- Dimitar - Tracker Supp
- Site Admin
- Posts: 1797
- Joined: Mon Jan 15, 2018 9:01 am
Re: Multiple OCR Runs -> Duplicate Text Objects
Hello Anthony,
I was able to reproduce the problem at my end.
I forwarded this case to the developers of the EOCR for further investigation.
When there is a development on this issue, we will contact you.
Thank you for your report.
Regards.
I was able to reproduce the problem at my end.
I forwarded this case to the developers of the EOCR for further investigation.
When there is a development on this issue, we will contact you.
Thank you for your report.
Regards.
Re: Multiple OCR Runs -> Duplicate Text Objects
Hello all,
I completely agree with Anthony's request for the ability to overwrite recognized OCR text without duplication when OCR is performed multiple times.
I have always felt that this is a major disadvantage to Acrobat when comparing the OCR (EOCR) feature of PDF-XChange Editor to Acrobat. In the past, similar requests have been made on this forum from time to time and have ultimately led to negative conclusions due to niche and other reasons, but this functionality has been commonplace in Acrobat for over a decade.
My daily workflow is to scan and digitize paper documents, and when corrections to the original document are later discovered, I basically modify the image of the PDF base content to directly reflect those corrections.
Acrobat will automatically update the OCR text in the corrected area by simply performing OCR on all pages, even if the file has already been recognized by OCR. You can obtain OCR text without duplicates by simply performing OCR as usual without any special settings. However, with PDF-XChange Editor, you must first manually delete the previously recognized OCR text in the Contents pane, then select only the pages you want to correct and perform OCR. This is a rather tedious process and increases the risk of making mistakes if there are multiple areas to be corrected.
Hopefully someday developers will be more interested in improving this feature.
Best regards,
rakunavi
- PDF-XChange Editor Plus Version: 10.1.0 build 380
- OS Version: Windows 11 Home 22H2 Build 22621.2134
- PC Model: Lenovo IdeaPad C340-15IWL, HP All-in-One 22-c0xx
P.S.
Although unrelated to this topic, the problem reported in the following topic, where unnecessary white space was recognized in OCR for Japanese and other languages, has been mostly resolved in build 380.
I completely agree with Anthony's request for the ability to overwrite recognized OCR text without duplication when OCR is performed multiple times.
I have always felt that this is a major disadvantage to Acrobat when comparing the OCR (EOCR) feature of PDF-XChange Editor to Acrobat. In the past, similar requests have been made on this forum from time to time and have ultimately led to negative conclusions due to niche and other reasons, but this functionality has been commonplace in Acrobat for over a decade.
My daily workflow is to scan and digitize paper documents, and when corrections to the original document are later discovered, I basically modify the image of the PDF base content to directly reflect those corrections.
Acrobat will automatically update the OCR text in the corrected area by simply performing OCR on all pages, even if the file has already been recognized by OCR. You can obtain OCR text without duplicates by simply performing OCR as usual without any special settings. However, with PDF-XChange Editor, you must first manually delete the previously recognized OCR text in the Contents pane, then select only the pages you want to correct and perform OCR. This is a rather tedious process and increases the risk of making mistakes if there are multiple areas to be corrected.
Hopefully someday developers will be more interested in improving this feature.
Best regards,
rakunavi
- PDF-XChange Editor Plus Version: 10.1.0 build 380
- OS Version: Windows 11 Home 22H2 Build 22621.2134
- PC Model: Lenovo IdeaPad C340-15IWL, HP All-in-One 22-c0xx
P.S.
Although unrelated to this topic, the problem reported in the following topic, where unnecessary white space was recognized in OCR for Japanese and other languages, has been mostly resolved in build 380.
- extra white spaces from OCR of full justified text on scanned pages
viewtopic.php?p=158203#p158203
TOP desires for PDFXCE
forum.pdf-xchange.com/viewtopic.php?t=39665 LassoTool
forum.pdf-xchange.com/viewtopic.php?t=38554 CmtGarbled
forum.pdf-xchange.com/viewtopic.php?t=37353 FulScrMultiMon
forum.pdf-xchange.com/viewtopic.php?t=41002 DisableTouchSelect
forum.pdf-xchange.com/viewtopic.php?t=39665 LassoTool
forum.pdf-xchange.com/viewtopic.php?t=38554 CmtGarbled
forum.pdf-xchange.com/viewtopic.php?t=37353 FulScrMultiMon
forum.pdf-xchange.com/viewtopic.php?t=41002 DisableTouchSelect
- Dimitar - Tracker Supp
- Site Admin
- Posts: 1797
- Joined: Mon Jan 15, 2018 9:01 am
Re: Multiple OCR Runs -> Duplicate Text Objects
Thanks for the feedback.
We are constantly improving our products and we will continue to do so.
Regards.
We are constantly improving our products and we will continue to do so.
Regards.
-
- User
- Posts: 1069
- Joined: Mon Nov 15, 2021 8:38 pm
Re: Multiple OCR Runs -> Duplicate Text Objects
I support this! it would be nice if PDF-xce could overwrite OCRs... as the old OCR results for certain pages could be inaccurate... if changes have been made to certain pages
Wishlist
Bookmarks with page numbers
Optional fixed small icon size in the toolbar
Shift to UNLOCK aspect ratio/i]
Allow more "toolbars" to the title bar
AltGr issues with character input and keyboard shortcuts
Bookmarks with page numbers
Optional fixed small icon size in the toolbar
Shift to UNLOCK aspect ratio/i]
Allow more "toolbars" to the title bar
AltGr issues with character input and keyboard shortcuts
- TrackerSupp-Daniel
- Site Admin
- Posts: 8613
- Joined: Wed Jan 03, 2018 6:52 pm
Re: Multiple OCR Runs -> Duplicate Text Objects
Hello, MedBooster
We do overwrite the OCR content when performing an Editable text OCR process with the new Enhanced OCR engine. We will be looking into offering options for this with "searchable text" in the future, but I cannot make a promise for implementation at this time.
Kind regards,
We do overwrite the OCR content when performing an Editable text OCR process with the new Enhanced OCR engine. We will be looking into offering options for this with "searchable text" in the future, but I cannot make a promise for implementation at this time.
Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com