OCR option - ignore existing text on page

This Forum is for the use of End Users requiring help and assistance for Tracker Software's PDF-Tools.

Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Tracker Supp-Stefan

Post Reply
fletch
User
Posts: 79
Joined: Wed Mar 11, 2020 2:53 am

OCR option - ignore existing text on page

Post by fletch »

https://help.pdf-xchange.com/pdfxt8/

What does the subject option do? That's not explicitly covered in the help above.
image.png
Ideally it would somehow see the existing text and not duplicate it - but that's not the case. So I'm confused about what ignore means in this instance.
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8371
Joined: Wed Jan 03, 2018 6:52 pm

Re: OCR option - ignore existing text on page

Post by TrackerSupp-Daniel »

Hi, fletch

You understand correctly, This option should (and in my tests at least, does) ignore areas where there are existing text objects on the page. Note that "shapes" and "images" which look like text will not activate this setting, it is explicitly for existing "Base content text" objects (NOT annotations like the typewriter), that may happen to overlap part of the document that has one of the former two that could have been scanned.

To test this, please open a scanned document in the Editor, and use the "Add text" tool to place text overtop of a section of the text. Then save and process the document with this option enabled, you should find that section you wrote overtop of has not been scanned/altered.

If you are finding this is not the case, please send us a copy of the file you are OCRing so that we can run some additional tests.

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
fletch
User
Posts: 79
Joined: Wed Mar 11, 2020 2:53 am

Re: OCR option - ignore existing text on page

Post by fletch »

I'll see if I can find a document I can share. But I do see text layers being duplicated on subsequent OCR scans even though that option is enabled.
fletch
User
Posts: 79
Joined: Wed Mar 11, 2020 2:53 am

Re: OCR option - ignore existing text on page

Post by fletch »

Is this the expected behavior given the options shown?

https://www.screencast.com/t/Pm6pFTAz
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8371
Joined: Wed Jan 03, 2018 6:52 pm

Re: OCR option - ignore existing text on page

Post by TrackerSupp-Daniel »

Hi, fletch

Thank you for the example, I see the issue here now. It appears that this function is only working on visible text; invisible (or "searchable") text, is not included as "existing text" for some reason.

I will inform our dev team of this and see what can be done.

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
User avatar
Jensen Head
User
Posts: 412
Joined: Mon Sep 13, 2021 8:12 am

Re: OCR option - ignore existing text on page

Post by Jensen Head »

What was the response from the development team to such strange behavior of the application?
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8371
Joined: Wed Jan 03, 2018 6:52 pm

Re: OCR option - ignore existing text on page

Post by TrackerSupp-Daniel »

Hello, Jensen Head

This topic was brought up roughly 2 years ago now, my memory is not great, but If i recall, at the time, the issue was a limitation of the OCR engine in use, and as we were just testing with the new engine, we were unsure what its capabilities were. So this suggestion was rejected at the time..

I have just had another talk with the Dev's now that the new OCR engine is properly implemented and quite stable to see what we can do. They have said that it will be extremely difficult to do, but it may be possible to offer an option which will try to avoid placing duplicate text, even if that text is invisible.

I cannot make any promises that this will be implemented, or when you might see it, but we now have an official feature request for the topic:
RT#6061: FR: OCR option to "prevent duplicating output text"

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
fletch
User
Posts: 79
Joined: Wed Mar 11, 2020 2:53 am

Re: OCR option - ignore existing text on page

Post by fletch »

Glad to hear this has been re-opened for consideration. I'd not thought to test it after updating to v9. Was holding off on OCR'ing my 20+years of documents until it got resolved.

So, the old test case was still lying around and I gave it a go. DIFFERENT results now. I don't know if this was slipped into the the v8 stream at some point or only v9.

It's almost like someone coded a workaround. Or, some other change was made that "sort of " eliminated the problem. In the PDF world, what is a "Path"?

When I OCR with the most recent v9 Pro pack installed I get Paths for EVERY piece of text, followed by the Text elements further down in the list.

BUT - when I OCR the SAME document the SECOND (or even 3rd) time NO NEW "Text" elements are added. Note that I'm overwriting the existing file with a new one of the same name - preserving date/time (thanks again for fixing that option long ago). I'm also using the FineReader OCR engine.

I don't know if the new Path elements somehow prevent the duplicate text from being added after each OCR run or if they have a different meaning and the updated OCR engine just works now - in that it doesn't duplicate text.

I grabbed another document that (in its native/original form) already had Text elements. I OCR'd it and those existing Text elements were NOT duplicated.

So, what is Path and why in v9 do I get allot of Path entries after OCR'ing a document, whereas in v8 I did not. Just a new feature of the newer OCR engine?

Below is a snapshot of the Content elements I see using v9 with the SAME document I used in my original video above.

OCR-2022.png
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8371
Joined: Wed Jan 03, 2018 6:52 pm

Re: OCR option - ignore existing text on page

Post by TrackerSupp-Daniel »

Hello, fletch

This is the reply I got from the Dev team about this:
When we add a visible text we also try to remove existing content under this text, but this content can be anything (solid or not, raster or vector or/and text, or another combination of these). In many cases we cannot guarantee that this content will be removed perfectly, some pixels (in raster), particles (vector) may be left over.
OCR can provide slightly different text positions than the text originally appeared on the page, so additionally, under the added text, we cover the redacted region with a solid-color rectangle - to hide these 'fluctuations'.
Those rectangles are the "paths" that you are seeing added to the page. If OCR is run multiple times, it can occur that only part of the previous "path" is redacted (leaving a section of it in place), and then the new path is added once again, seemingly duplicating the path, but really, modifying one, and adding a new one in the location of the partially removed object.

In many cases, since we are redacting the original section of the page, there will be nothing below it, so it is often safe to simply delete these path objects, especially if you plan to run OCR on the file another time after doing so.

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
fletch
User
Posts: 79
Joined: Wed Mar 11, 2020 2:53 am

Re: OCR option - ignore existing text on page

Post by fletch »

Ok. There are too many to bother deleting - considering the number of PDF's I'll ultimately be indexing. Since after multiple OCR runs those paths are not duplicated - that's great. Whether or not that is somehow contributing to the EXISTING OCR text NO LONGER being duplicated is a mystery. But the original problem I illustrated in the video above seems fixed now!
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17765
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: OCR option - ignore existing text on page

Post by Tracker Supp-Stefan »

Hello fletch,

Glad to hear that the original issue is resolved!

Kind regards,
Stefan
Post Reply