Multiple OCR Runs -> Duplicate Text Objects

ARM07470 · Post by **ARM07470** » Thu Apr 16, 2020 9:29 pm

I'm using v8.0.337.0 with the Enhanced OCR Plugin with the Output Type set to Searchable Image. I've noticed that multiple runs of the OCR feature add (instead of replace) the invisible text objects each time, which can be seen by viewing the Content pane. Is there an option available, or would development consider adding one if not, to have OCR remove all invisible text objects before running?

I realize that it might seem silly to run OCR multiple times, but this is handy to do when the original OCR was done by another program/device such a scanner that has a poorer quality OCR engine or when I want to try different settings (such as Accuracy) using your engine.

- Anthony

Post by **TrackerSupp-Daniel** » Thu Apr 16, 2020 9:49 pm

Hi, ARM07470

You can set our OCR engine to ignore areas of the page which already contain text content (including invisible text) however it is not possible to automatically remove that text prior to placing the new content.

The best method to achieve what you are looking for, is to use the "Edit > text" tool, then highlight the text on the page (or enter the content pane and use "Select text" to highlight the entire documents text) and manually delete it like that before running OCR.

Kind regards,

ARM07470 · Post by **ARM07470** » Sun Apr 19, 2020 3:39 am

The "ignore existing text on page" has no effect on the behavior I'm describing -- multiple successive runs of the OCR tool result in additional, duplicate, invisible text being added to the document every time. I can't quite figure out the point of this option in the first place. Why would one ever OCR something that is already stored as text?

Quick sidebar...the "Skip pages that already contain text" option is not mentioned in the documentation besides being shown in the screenshots. I was wondering if this only skipped pages with invisible text or if it also skipped pages with visible text and was hoping that the documentation would tell me. Through experimentation, I found that it skips pages with any text, visible or invisible.

I already stumbled upon the workaround you described to delete the invisible text before running OCR and it works fine but I'd still like to see an option added to do this for me automatically. If the developers are concerned that this might be too confusing to put on the dialog, then maybe it could be added Preferences -> OCR Settings so it would be more out of the way. I'd call it "Remove invisible text before OCR" or something like that. By the way, this appears to be the default behavior of Acrobat, so you wouldn't be alone in implementing this functionality.

- Anthony

Post by **TrackerSupp-Daniel** » Mon Apr 20, 2020 8:29 pm

Hi, ARM07470

Can I ask you to please send us a copy of a document which this is not working on and a screenshot of your current OCR settings?

The Ignore Existing text function is designed for those who have a partially editable document and want to scan it to locate the missing text items. Say you just added a few images into your document, or scanned a few pages and added them in. It is handy to just check that box and have every page scanned, then have the engine ignore areas where text already exists and only add the new items in, or for people like yourself, who decided to run OCR multiple times.

For your second question, yes, Skip pages that already contain text will skip any pages which contain any text content whatsoever, even the page numbers added with the Header/Footer tool will prevent this from running on a page.

As for the automated removal of text during OCR, I cannot make any promises, but I will pass it along to the Team and see what they think of it.

Kind regards,

ARM07470 · Post by **ARM07470** » Mon Apr 20, 2020 8:53 pm

I've attached a sample PDF and a screen shot of my OCR settings. Note that I ran these same settings twice and see the duplicate text content I've highlighted at the left of the screen shot.

- Anthony

Tue Apr 21, 2020 1:36 pm

Hello Anthony,

Thank you for the provided file.

Could you please provide us also with the file before the conversion?

Regards.

ARM07470 · Post by **ARM07470** » Tue Apr 21, 2020 5:15 pm

I didn't retain the original. It was created from scratch in PDF X-Change via the Scan function. You can remove everything but the Image element and you'd be back to where I started. Also, I don't think there is anything special about this document with regard to the behavior I'm describing. I've seen it with a variety of PDFs from different sources (Acrobat, MFP, etc.).

- Anthony

Thu Apr 23, 2020 2:42 pm

Hello Anthony,

I was able to reproduce the problem at my end.

I forwarded this case to the developers of the EOCR for further investigation.

When there is a development on this issue, we will contact you.

Thank you for your report.

Regards.

rakunavi · Post by **rakunavi** » Thu Sep 07, 2023 11:37 pm

Hello all,

I completely agree with Anthony's request for the ability to overwrite recognized OCR text without duplication when OCR is performed multiple times.

I have always felt that this is a major disadvantage to Acrobat when comparing the OCR (EOCR) feature of PDF-XChange Editor to Acrobat. In the past, similar requests have been made on this forum from time to time and have ultimately led to negative conclusions due to niche and other reasons, but this functionality has been commonplace in Acrobat for over a decade.

My daily workflow is to scan and digitize paper documents, and when corrections to the original document are later discovered, I basically modify the image of the PDF base content to directly reflect those corrections.

Acrobat will automatically update the OCR text in the corrected area by simply performing OCR on all pages, even if the file has already been recognized by OCR. You can obtain OCR text without duplicates by simply performing OCR as usual without any special settings. However, with PDF-XChange Editor, you must first manually delete the previously recognized OCR text in the Contents pane, then select only the pages you want to correct and perform OCR. This is a rather tedious process and increases the risk of making mistakes if there are multiple areas to be corrected.

Hopefully someday developers will be more interested in improving this feature.

Best regards,
rakunavi

- PDF-XChange Editor Plus Version: 10.1.0 build 380
- OS Version: Windows 11 Home 22H2 Build 22621.2134
- PC Model: Lenovo IdeaPad C340-15IWL, HP All-in-One 22-c0xx

P.S.
Although unrelated to this topic, the problem reported in the following topic, where unnecessary white space was recognized in OCR for Japanese and other languages, has been mostly resolved in build 380.

extra white spaces from OCR of full justified text on scanned pages
viewtopic.php?p=158203#p158203

In addition, the rendering cache on Undo, the rendering cache limited to CapturePerfect, and the behavior of the stylus pen's Auto setting have been improved. In build 380, I have confirmed that there have been a total of 13 improvements to the issues I have reported. I would like to express my sincere appreciation to the developers and support staff for their hard work.

Fri Sep 08, 2023 2:26 pm

Thanks for the feedback.

We are constantly improving our products and we will continue to do so.

Regards.

MedBooster · Post by **MedBooster** » Wed Nov 22, 2023 5:53 pm

I support this! it would be nice if PDF-xce could overwrite OCRs... as the old OCR results for certain pages could be inaccurate... if changes have been made to certain pages

Post by **TrackerSupp-Daniel** » Wed Nov 22, 2023 6:27 pm

Hello, MedBooster

We do overwrite the OCR content when performing an Editable text OCR process with the new Enhanced OCR engine. We will be looking into offering options for this with "searchable text" in the future, but I cannot make a promise for implementation at this time.

Kind regards,

Multiple OCR Runs -> Duplicate Text Objects

Multiple OCR Runs -> Duplicate Text Objects

Re: Multiple OCR Runs -> Duplicate Text Objects

Re: Multiple OCR Runs -> Duplicate Text Objects

Re: Multiple OCR Runs -> Duplicate Text Objects

Re: Multiple OCR Runs -> Duplicate Text Objects

Re: Multiple OCR Runs -> Duplicate Text Objects

Re: Multiple OCR Runs -> Duplicate Text Objects

Re: Multiple OCR Runs -> Duplicate Text Objects

Re: Multiple OCR Runs -> Duplicate Text Objects

Re: Multiple OCR Runs -> Duplicate Text Objects

Re: Multiple OCR Runs -> Duplicate Text Objects

Re: Multiple OCR Runs -> Duplicate Text Objects