Multiple OCR Runs -> Duplicate Text Objects

Discussion for the End User use of OCR in PDF-XChange Editor and Viewer

Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan

Post Reply
ARM07470
User
Posts: 4
Joined: Thu Apr 16, 2020 7:38 pm

Multiple OCR Runs -> Duplicate Text Objects

Post by ARM07470 »

I'm using v8.0.337.0 with the Enhanced OCR Plugin with the Output Type set to Searchable Image. I've noticed that multiple runs of the OCR feature add (instead of replace) the invisible text objects each time, which can be seen by viewing the Content pane. Is there an option available, or would development consider adding one if not, to have OCR remove all invisible text objects before running?

I realize that it might seem silly to run OCR multiple times, but this is handy to do when the original OCR was done by another program/device such a scanner that has a poorer quality OCR engine or when I want to try different settings (such as Accuracy) using your engine.

- Anthony
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8440
Joined: Wed Jan 03, 2018 6:52 pm

Re: Multiple OCR Runs -> Duplicate Text Objects

Post by TrackerSupp-Daniel »

Hi, ARM07470

You can set our OCR engine to ignore areas of the page which already contain text content (including invisible text) however it is not possible to automatically remove that text prior to placing the new content.
PDFXEdit_yIivAvLcyZ.png
The best method to achieve what you are looking for, is to use the "Edit > text" tool, then highlight the text on the page (or enter the content pane and use "Select text" to highlight the entire documents text) and manually delete it like that before running OCR.

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
ARM07470
User
Posts: 4
Joined: Thu Apr 16, 2020 7:38 pm

Re: Multiple OCR Runs -> Duplicate Text Objects

Post by ARM07470 »

The "ignore existing text on page" has no effect on the behavior I'm describing -- multiple successive runs of the OCR tool result in additional, duplicate, invisible text being added to the document every time. I can't quite figure out the point of this option in the first place. Why would one ever OCR something that is already stored as text?

Quick sidebar...the "Skip pages that already contain text" option is not mentioned in the documentation besides being shown in the screenshots. I was wondering if this only skipped pages with invisible text or if it also skipped pages with visible text and was hoping that the documentation would tell me. Through experimentation, I found that it skips pages with any text, visible or invisible.

I already stumbled upon the workaround you described to delete the invisible text before running OCR and it works fine but I'd still like to see an option added to do this for me automatically. If the developers are concerned that this might be too confusing to put on the dialog, then maybe it could be added Preferences -> OCR Settings so it would be more out of the way. I'd call it "Remove invisible text before OCR" or something like that. By the way, this appears to be the default behavior of Acrobat, so you wouldn't be alone in implementing this functionality.

- Anthony
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8440
Joined: Wed Jan 03, 2018 6:52 pm

Re: Multiple OCR Runs -> Duplicate Text Objects

Post by TrackerSupp-Daniel »

Hi, ARM07470

Can I ask you to please send us a copy of a document which this is not working on and a screenshot of your current OCR settings?

The Ignore Existing text function is designed for those who have a partially editable document and want to scan it to locate the missing text items. Say you just added a few images into your document, or scanned a few pages and added them in. It is handy to just check that box and have every page scanned, then have the engine ignore areas where text already exists and only add the new items in, or for people like yourself, who decided to run OCR multiple times.

For your second question, yes, Skip pages that already contain text will skip any pages which contain any text content whatsoever, even the page numbers added with the Header/Footer tool will prevent this from running on a page.

As for the automated removal of text during OCR, I cannot make any promises, but I will pass it along to the Team and see what they think of it.

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
ARM07470
User
Posts: 4
Joined: Thu Apr 16, 2020 7:38 pm

Re: Multiple OCR Runs -> Duplicate Text Objects

Post by ARM07470 »

I've attached a sample PDF and a screen shot of my OCR settings. Note that I ran these same settings twice and see the duplicate text content I've highlighted at the left of the screen shot.

image.png

- Anthony
Attachments
2X OCR Test.pdf
(1.99 MiB) Downloaded 198 times
User avatar
Dimitar - Tracker Supp
Site Admin
Posts: 1778
Joined: Mon Jan 15, 2018 9:01 am

Re: Multiple OCR Runs -> Duplicate Text Objects

Post by Dimitar - Tracker Supp »

Hello Anthony,

Thank you for the provided file.

Could you please provide us also with the file before the conversion?


Regards.
ARM07470
User
Posts: 4
Joined: Thu Apr 16, 2020 7:38 pm

Re: Multiple OCR Runs -> Duplicate Text Objects

Post by ARM07470 »

I didn't retain the original. It was created from scratch in PDF X-Change via the Scan function. You can remove everything but the Image element and you'd be back to where I started. Also, I don't think there is anything special about this document with regard to the behavior I'm describing. I've seen it with a variety of PDFs from different sources (Acrobat, MFP, etc.).

- Anthony
User avatar
Dimitar - Tracker Supp
Site Admin
Posts: 1778
Joined: Mon Jan 15, 2018 9:01 am

Re: Multiple OCR Runs -> Duplicate Text Objects

Post by Dimitar - Tracker Supp »

Hello Anthony,

I was able to reproduce the problem at my end.

I forwarded this case to the developers of the EOCR for further investigation.

When there is a development on this issue, we will contact you.

Thank you for your report.

Regards.
User avatar
rakunavi
User
Posts: 871
Joined: Sat Sep 11, 2021 5:04 am

Re: Multiple OCR Runs -> Duplicate Text Objects

Post by rakunavi »

Hello all,

I completely agree with Anthony's request for the ability to overwrite recognized OCR text without duplication when OCR is performed multiple times.

I have always felt that this is a major disadvantage to Acrobat when comparing the OCR (EOCR) feature of PDF-XChange Editor to Acrobat. In the past, similar requests have been made on this forum from time to time and have ultimately led to negative conclusions due to niche and other reasons, but this functionality has been commonplace in Acrobat for over a decade.

My daily workflow is to scan and digitize paper documents, and when corrections to the original document are later discovered, I basically modify the image of the PDF base content to directly reflect those corrections.

Acrobat will automatically update the OCR text in the corrected area by simply performing OCR on all pages, even if the file has already been recognized by OCR. You can obtain OCR text without duplicates by simply performing OCR as usual without any special settings. However, with PDF-XChange Editor, you must first manually delete the previously recognized OCR text in the Contents pane, then select only the pages you want to correct and perform OCR. This is a rather tedious process and increases the risk of making mistakes if there are multiple areas to be corrected.

Hopefully someday developers will be more interested in improving this feature.

Best regards,
rakunavi

- PDF-XChange Editor Plus Version: 10.1.0 build 380
- OS Version: Windows 11 Home 22H2 Build 22621.2134
- PC Model: Lenovo IdeaPad C340-15IWL, HP All-in-One 22-c0xx

P.S.
Although unrelated to this topic, the problem reported in the following topic, where unnecessary white space was recognized in OCR for Japanese and other languages, has been mostly resolved in build 380.

In addition, the rendering cache on Undo, the rendering cache limited to CapturePerfect, and the behavior of the stylus pen's Auto setting have been improved. In build 380, I have confirmed that there have been a total of 13 improvements to the issues I have reported. I would like to express my sincere appreciation to the developers and support staff for their hard work.
TOP desires for PDFXCE
forum.pdf-xchange.com/viewtopic.php?t=39665 LassoTool
forum.pdf-xchange.com/viewtopic.php?t=38554 CmtGarbled
forum.pdf-xchange.com/viewtopic.php?t=37353 FulScrMultiMon
forum.pdf-xchange.com/viewtopic.php?t=41002 DisableTouchSelect
User avatar
Dimitar - Tracker Supp
Site Admin
Posts: 1778
Joined: Mon Jan 15, 2018 9:01 am

Re: Multiple OCR Runs -> Duplicate Text Objects

Post by Dimitar - Tracker Supp »

Thanks for the feedback.

We are constantly improving our products and we will continue to do so.

Regards.
MedBooster
User
Posts: 1011
Joined: Mon Nov 15, 2021 8:38 pm

Re: Multiple OCR Runs -> Duplicate Text Objects

Post by MedBooster »

I support this! it would be nice if PDF-xce could overwrite OCRs... as the old OCR results for certain pages could be inaccurate... if changes have been made to certain pages
Wishlist
Bookmarks with page numbers
Optional fixed small icon size in the toolbar
Shift to UNLOCK aspect ratio/i]
Allow more "toolbars" to the title bar
AltGr issues with character input and keyboard shortcuts
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8440
Joined: Wed Jan 03, 2018 6:52 pm

Re: Multiple OCR Runs -> Duplicate Text Objects

Post by TrackerSupp-Daniel »

Hello, MedBooster

We do overwrite the OCR content when performing an Editable text OCR process with the new Enhanced OCR engine. We will be looking into offering options for this with "searchable text" in the future, but I cannot make a promise for implementation at this time.

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Post Reply