How to remove previously OCRed text from scan [Paperless Office]

This Forum is for the use of End Users requiring help and assistance for Tracker Software's PDF-Tools.

Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Tracker Supp-Stefan

Post Reply
User avatar
patrickm
User
Posts: 25
Joined: Wed Oct 13, 2021 5:43 am
Location: Los Angeles, CA

How to remove previously OCRed text from scan [Paperless Office]

Post by patrickm »

I previously OCRed a lot of scans with an OCR engine that wasn't so good.

How can I use PDF Tools to
- on pages which are covered by an image (ie a scanned page)
- remove the existing text layer
- OCR the page again and add a new text layer
- ideally if a page is already text based (ie any image is only a logo and not covering the page) the page would be skipped from OCR

Thanks,
Patrick
image.png
Last edited by patrickm on Mon Jan 23, 2023 10:29 pm, edited 2 times in total.
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8371
Joined: Wed Jan 03, 2018 6:52 pm

Re: How to remove previously OCRed text from scan

Post by TrackerSupp-Daniel »

Hello, patrickm

The Enhanced OCR dialog already offers options for these functions. If you would like to overwrite (and remove) the existing text, be it invisible or otherwise, simply run OCR without checking any of the options to ignore text, then choose "editable text and images" this will replace sections of the image, AND any existing text items in those locations, with the new OCR'ed text. This will effectively get rid of accidental duplicate layers of text, assuming that they exist in a location where text actually appears to be. (if you have invisible text in a blank area of the page, OCR wont make changes there, and so that text will not be removed)

likewise, you can run OCR and skip pages, or areas of pages, which contain text, as well as skip "graphics containing text" (such as a picture of your friend Steve wearing a shirt that says "live, laugh, love"). See the highlighted options below:
image.png
Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
User avatar
patrickm
User
Posts: 25
Joined: Wed Oct 13, 2021 5:43 am
Location: Los Angeles, CA

Re: How to remove previously OCRed text from scan

Post by patrickm »

Hey Daniel,

thank you for your answer. "editable text and image" will alter the existing the look of the PDF right?
Since I'm using PDF for archiving of documents this feature won't work and I'll need to stick to "searchable image" as it does not alter the image of the document.

So it looks like the feature to remove existing OCRed text from a searchable image doesn't exist correct?
How can this be achieved with the current toolset please? And if it can't be will it be developed?

I imagine it would be helpful feature for many customers to be able to reOCR their collection of scans which have been done with older OCR engines.

Thank you,
Patrick
User avatar
John - Tracker Supp
Site Admin
Posts: 5219
Joined: Tue Jun 29, 2004 10:34 am
Location: United Kingdom
Contact:

Re: How to remove previously OCRed text from scan

Post by John - Tracker Supp »

I suspect the best way to currently achieve this if I understand your objective would be to simply reprint the file using PDF-XChange 'Standard' our print driver as I believe the underlying text you wish to lose would then be removed.

Or do I misunderstand?

Currently the feature you require is not a priority and we have had no other requests for such a feature and we do have to prioritise our development plans according to demand and likely usage.
If posting files to this forum - you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded - thank you.

Best regards
Tracker Support
http://www.tracker-software.com
User avatar
patrickm
User
Posts: 25
Joined: Wed Oct 13, 2021 5:43 am
Location: Los Angeles, CA

Re: How to remove previously OCRed text from scan

Post by patrickm »

My use case is a whole folder structure of files that were OCRed using old technologies. Some are scans, some are digital created PDFs (ie text), some have bookmarks. So reprinting it won't work. And just running OCR doubles the already existing text.

I imagine many of your customers don't know that either the OCR process is skipped all together or text is just added to the already existing text (ie problems will occur when they try to copy and paste later)... Ie its an invisible problem that most customers imagine just works when in reality there are some problems.

And I get you customers are unaware so there doesn't seem to be demand for it.
User avatar
patrickm
User
Posts: 25
Joined: Wed Oct 13, 2021 5:43 am
Location: Los Angeles, CA

Re: How to remove previously OCRed text from scan

Post by patrickm »

Just in case you are considering it ;)

Maybe it could be added to "Optimize PDF" as "Discard Invisible Text Layer (for searchable images)"
image.png
And to keep it easy it would just work on scans (ie full page images with an invisible text layer behind them)

I would find it very useful to re-run OCR with PDF Tools on all my scans with the enhanced OCR engine. It's a feature cool enough that people might be quite excited about it if its advertised.

Thanks for considering it!
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8371
Joined: Wed Jan 03, 2018 6:52 pm

Re: How to remove previously OCRed text from scan

Post by TrackerSupp-Daniel »

Hello, patrickm

Thank you for the suggestion, I have created a formal ticket for you on this matter, thought I cannot promise it will be implemented at this time, know that our Dev team will review this when next we are looking at new features:

RT#5773: Add "discard invisible text" to Cleanup/OCR options

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
User avatar
Paul - Tracker Supp
Site Admin
Posts: 6813
Joined: Wed Mar 25, 2009 10:37 pm
Location: Chemainus, Canada
Contact:

Re: How to remove previously OCRed text from scan

Post by Paul - Tracker Supp »

Hi Patrick,

try the "Rasterize pages" feature. I just ran a test, put invisible text on a document using the default OCR , saved it, then ran "Rasterize pages" over it and it removed the invisible text leaving only the images of the pages.

Can you test that and see if if does what you are looking for?
image.png
I ran the test in the Editor, but Tools has it also. There is a tool in PDF Tools to rasterize pages, I am keen to hear if that delivers what you are looking for:
image1.png
Best regards

Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
User avatar
patrickm
User
Posts: 25
Joined: Wed Oct 13, 2021 5:43 am
Location: Los Angeles, CA

Re: How to remove previously OCRed text from scan

Post by patrickm »

Thank you for trying to find a solution with me :)

Quick questions I have 3 types of pages (within documents)
- Scanned Pages without OCR
- Scanned Pages with OCR
- Digitally created Pages (ie no full page image just a text layer and sometimes embedded images (ie a typical book page))

What I desire is to:
- Scanned Pages without OCR -> Create OCR
- Scanned Pages with OCR -> Drop existing OCR and redo OCR
- Digitally created Pages -> Do not touch the text (don't care if the embedded image (ie not full page image but part of page image) is OCRed

Would the raster option allow for this?

Thanks
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8371
Joined: Wed Jan 03, 2018 6:52 pm

Re: How to remove previously OCRed text from scan

Post by TrackerSupp-Daniel »

Hello, patrickm

To a degree, it would not be automatic, but if you manually specify which pages you wish to rasterize (make it ignore the pages you do not want to change) it could be a helpful workaround in the meantime.

Lets say, as an example, that you have a 3 page document, ordered as per your above post, so pages 1 and 2 should be rasterizes and had OCR applied, and page 3 should only have a second pass of OCR, ignoring the existing text.
In this case, you would use the "rasterize pages" tool on pages 1 and 2, to flatten all content into a single image, and remove the existing invisible text content. After that is complete, run the OCR function on all pages, ensuring you check the box to ignore existing text on the page" (this will ensure you do not get a second layer over the existing text), and Voila! the process is done, with OCR being applied to all three pages, and not affecting your already existing text content.

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
User avatar
Jensen Head
User
Posts: 412
Joined: Mon Sep 13, 2021 8:12 am

Re: How to remove previously OCRed text from scan

Post by Jensen Head »

John - Tracker Supp wrote: Sun Oct 31, 2021 10:32 amwe have had no other requests for such a feature and we do have to prioritise our development plans according to demand and likely usage.
I guess this is due to the fact that people interested in this function simply found this topic, made sure that there was no solution, and left. Moreover, these users did not even subscribe to change this topic, because you do not mark suggestions in topics as subsequently implemented. So even if you add this feature, threads like this that don't end with an announcement that the requested feature has been successfully implemented in the release are anti-advertising.
Paul - Tracker Supp wrote: Mon Nov 01, 2021 4:47 pmtry the "Rasterize pages" feature
This action does remove the text, but it also degrades all graphic elements. What we need is automatic deletion of text that does not affect other objects in the document.
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8371
Joined: Wed Jan 03, 2018 6:52 pm

Re: How to remove previously OCRed text from scan

Post by TrackerSupp-Daniel »

Hello, Jensen Head

We do offer a "solved" tag that can be placed by either us, as administrators, when a feature is implemented if the topic was recorded during ticket creation, or by the topic creator, as they deem necessary. I am not certain what more you want here, as John said, there is very little support for this ticket currently, and so it is not a high priority. The ticket already exists and will see action when/if it is time, not before then.

If more users choose to post here showing their support, as they could have done anytime between the day the ticket was created, and now.
TrackerSupp-Daniel wrote: Mon Nov 01, 2021 4:43 pm I have created a formal ticket for you on this matter, thought I cannot promise it will be implemented at this time, know that our Dev team will review this when next we are looking at new features:

RT#5773: Add "discard invisible text" to Cleanup/OCR options
It is not fair to assume that everyone looking for a new feature will search through the thousands of posts we have here, only to say "oh, its not going to happen, oh well", many people will open a topic without searching at all, or open a livechat with us, or send an email. We have millions of clients and a large majority of them do not know these forums exist (or prefer not to use them). As we have seen no one other than patrickm making this request, until you yourself posted today (even through our other means of communication), this serves as a fairly good indication that the support for this is indeed lower than, say, improvements to 3D support, which we get requests for quite often.

It is important to note that we are not a mega-corporation like Adobe, we cannot simply throw more developers at problems or minor requests. We have a limited staff, and need to allocate resources where they will be most efficient, focusing on requests which have very little or no support is not a freedom we have.

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
User avatar
patrickm
User
Posts: 25
Joined: Wed Oct 13, 2021 5:43 am
Location: Los Angeles, CA

Re: How to remove previously OCRed text from scan

Post by patrickm »

Hey, I wanted to share a few thoughts:

> We do offer a "solved" tag
I think its more that his experience is that its not being used in practice.

> very little support for this ticket currently
It's actually different. Because I'm a developer I write very specific tickets in the hope that it makes them easier and also quick to implement. Most average users might not even know the problem exists that PDF Tools double up the layers.

But if I was to write the ticket differently "Using PDF Tools to fully process scanned documents for paperless office" then many people would comment as they would like PDF Tools to be a solution to their bigger problems.

The challenges with those tickets is that there would be 30 feature requests in them and then none of them would get implemented... But they would get more traction.

My ticket is a step on the way to make PDF Tools be able to process a whole folder or even hard drive of documents successfully (something many people just assume already works)...

Would all users love to right click on a folder and say make all my scanned PDFs in there perfect and searchable? Sure they would! I feel its a long term vision which would be a great to achieve. And then a ticket like mine makes sense even with little comments because it helps with the bigger vision to support "Paperless Office Workflows".

Paperless Office is big for sure!

Why they few people that want 3D in PDF don't just buy Adobe and make you put effort to implement it I have no idea.
User avatar
patrickm
User
Posts: 25
Joined: Wed Oct 13, 2021 5:43 am
Location: Los Angeles, CA

Re: How to remove previously OCRed text from scan

Post by patrickm »

Does that was fixed in this ticket https://forum.pdf-xchange.com/viewtopic.php?p=166217#p166217 address the desired points below?
patrickm wrote: Mon Nov 01, 2021 9:27 pm Quick questions I have 3 types of pages (within documents)
- Scanned Pages without OCR
- Scanned Pages with OCR
- Digitally created Pages (ie no full page image just a text layer and sometimes embedded images (ie a typical book page))

What I desire is to:
- Scanned Pages without OCR -> Create OCR
- Scanned Pages with OCR -> Drop existing OCR and redo OCR
- Digitally created Pages -> Do not touch the text (don't care if the embedded image (ie not full page image but part of page image) is OCRed

Would the raster option allow for this?
Thanks
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8371
Joined: Wed Jan 03, 2018 6:52 pm

Re: How to remove previously OCRed text from scan [Paperless Office]

Post by TrackerSupp-Daniel »

Hello, patrickm

The fix you linked to would allow you to do this if the "scanned pages with OCR" only have "searchable" (invisible) text on them, it would not work for pages which have editable text in place. Rasterizing pages is not a necessary part of the process. If you wish to ensure that:
- Scanned Pages without OCR -> Create OCR
- Scanned Pages with (searchable) OCR -> Drop existing OCR and redo OCR
- Digitally created Pages -> Do not touch the text (don't care if the embedded image (ie not full page image but part of page image) is OCRed
You would simply need to run Enhanced OCR on the file, with "editable text and images" enabled. Then checking the option to ignore existing page text, Only the editable page text will be ignored, and the pages with no OCR or only invisible text, will have their text replaced with editable text.

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
User avatar
patrickm
User
Posts: 25
Joined: Wed Oct 13, 2021 5:43 am
Location: Los Angeles, CA

Re: How to remove previously OCRed text from scan [Paperless Office]

Post by patrickm »

TrackerSupp-Daniel wrote: Tue Jan 24, 2023 12:15 am it would not work for pages which have editable text in place. Rasterizing pages is not a necessary part of the process.
Editable text is already searchable so that is great news. I want to be able to run PDF Tools on 100s of PDFs automatically and get the best results so this brings us a step closer! Thank you
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8371
Joined: Wed Jan 03, 2018 6:52 pm

How to remove previously OCRed text from scan [Paperless Office]

Post by TrackerSupp-Daniel »

:)
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Post Reply