Remove existing OCR layer?

Forum for the PDF-XChange Editor - Free and Licensed Versions

Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan

Post Reply
4mc
User
Posts: 40
Joined: Tue Apr 27, 2021 12:42 am
Contact:

Remove existing OCR layer?

Post by 4mc »

I read and search dozens of PDF's per day from a reputable research website.

Said website infrequently identifies pdf's based on a search but when I open the pdf with PDF-XChange Editor the search doesn't find the text in the pdf. When I try to select the text to copy from that page, the entire page text seems to be in the top left 1/4 of the displayed page rendering the text found unusable and completely out of alignment with the displayed text.

I've tried to re OCR with PDF-XChange Editor Plus and if I'm correct it seems to add another layer of text and selecting text becomes messy.

Is there a way to completely remove the existing layer of OCR Text and then re-OCR with PDF-XChange Editor Plus?

At the moment on some important documents I've resorted to exporting all pages as images and then creating a new document from the images.

It's quite likely I'm completely confused here and would appreciate someone telling me what I've done wrong. Re-ocr'ing a 150-page document for no reason would be dumb!

PDF-XChange Editor Plus is a life saver and has made so much so easy.

++Mark.
https://ctproduced.com
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8436
Joined: Wed Jan 03, 2018 6:52 pm

Re: Remove existing OCR layer?

Post by TrackerSupp-Daniel »

Hello, 4mc

would it be possible to see a copy of one of these documents? I am keep to see the original state here so that we can determine if OCR is necessary or if there is another solution.

If the files are too sensitive to share on a public forum, there are two methods to remove the old text so you can perform a new OCR process. If you know that the file only contains invisible "searchable" text at this time, you can open the "content" panel (accessed from the left side menu) and then use the "options > select > text" function within that pane, to select all text in the document. Simply press the delete key on your keyboard to remove it. Then you can perform OCR as usual.
The alternative, if you do have some editable text which you do not wish to have deleted, is to "rasterize" the document (the rasterize pages tool is located on the Convert tab, near the OCR tool). Simply run this once to convert each page into a single image, and then run OCR on it afterwards.

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
4mc
User
Posts: 40
Joined: Tue Apr 27, 2021 12:42 am
Contact:

Re: Remove existing OCR layer?

Post by 4mc »

Here is an example. The small yellow rectangle is the found text, and the gray lines are the selected text.
image.png
4mc
User
Posts: 40
Joined: Tue Apr 27, 2021 12:42 am
Contact:

Re: Remove existing OCR layer?

Post by 4mc »

Daniel, the files are not sensative as such, they are copyright and stamped for use in tracing if posted online.

I've just found another example and am happy to delete all pages except the example page and post it online as a pdf.

The circle top left is the OCR indicated search result, the circle towards the bottom is actually the text being searched for.

You can get the single page pdf here > https://www.ctproduced.com/wp-content/uploads/2023/03/community.28040161-example.pdf
image.png
++Mark.
https://ctproduced.com
4mc
User
Posts: 40
Joined: Tue Apr 27, 2021 12:42 am
Contact:

Re: Remove existing OCR layer?

Post by 4mc »

Just to be clear, the OCR shown was NOT added by PDF-Xchange, it's how it was when downloaded. Any suggestions welcome.

++Mark.
https://ctproduced.com
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8436
Joined: Wed Jan 03, 2018 6:52 pm

Re: Remove existing OCR layer?

Post by TrackerSupp-Daniel »

Hello, 4mc

Looking at the position and format of the text here, it seems as though someone scanned the document into a paper size that was far too large for the scanned page, then tried to blow up the image to make it fit, without also rescaling the text..
Assuming that the text in this file is correct, you could simply select the image itself with the Edit tool, and shift+drag the the corner of the image down to match the size of the text (or vice versa, select the text, and increase its size to match the image). Either of those options would retain the original text and ensure that the existing searchability is functional. Both would likely also be more work than simply running OCR again, but may be more (or less) reliable/true to the original, depending on the previous OCR engines capabilities.

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
4mc
User
Posts: 40
Joined: Tue Apr 27, 2021 12:42 am
Contact:

Re: Remove existing OCR layer?

Post by 4mc »

I appreciate the feedback. Many of these pages come from large books or reports. Dragging images around one page at a time isn't practical in many/most cases.

As stated in the original question, is there a way to delete, remove or void the original OCR text?

It's more productive to do this and then re-convert the document using PDF-XChange? I'm willing to take a chance on the OCR text not being as accurate.
Willy Van Nuffel
User
Posts: 2347
Joined: Wed Jan 18, 2006 12:10 pm

Re: Remove existing OCR layer?

Post by Willy Van Nuffel »

- Open the PDF with the OCRed text in PDF-XChange Editor
- Activate the "Content"-pane, via the View-ribbon > Panes > Content
- In the Content-pane, in the toolbar, click Options... > Select > Text
- Press the Delete-key on your keyboard

Now, all the text is removed from the PDF, and you can run OCR again.

Kind regards.
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17818
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Remove existing OCR layer?

Post by Tracker Supp-Stefan »

:)
nathaleen
User
Posts: 2
Joined: Fri May 05, 2023 8:24 am

Re: Remove existing OCR layer?

Post by nathaleen »

I did an OCR of my pdf. Now the pdf is almost unreadable. Unfortuantely, I already saved the pdf. How can i go back to the origina set previous to doing the OCR?

I did some the propostions presented here. It means, I went in into Pane -> Content-> select Text and delete. But this does acutally delete the whole text of the PDF!

I just want to go back to the original state of the document before the OCR.
Willy Van Nuffel
User
Posts: 2347
Joined: Wed Jan 18, 2006 12:10 pm

Re: Remove existing OCR layer?

Post by Willy Van Nuffel »

It is a little bit difficult to judge the situation without seeing the pdf itself.

Most probably you have used the "Enhanced OCR" option and removed the original scanned image or text?

Myself, I am afraid that it will be almost impossible to restore the original file based on the pdf you last saved.

I guess the pdf was made/delivered by a scanner or was created from another file (Word, Excel, ...) ?
Are you able to get a new copy of the pdf from elsewhere, or can it be recreated ?

In the worst case you could send a copy of the pdf to Tracker Support and ask if they see a possibility to restore the original.

Kind regards.
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17818
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: Remove existing OCR layer?

Post by Tracker Supp-Stefan »

Hello Willy Van Nuffel,

Thanks for the help! Indeed if the Enhanced OCR was used and the original image altered - @nathaleen will need to find a copy of the source document somehow, as the PDF already saved is likely not recoverable to the original state.

In any case - seeing a sample will help us all to assist you further nathaleen!

Kind regards,
Stefan
nathaleen
User
Posts: 2
Joined: Fri May 05, 2023 8:24 am

Re: Remove existing OCR layer?

Post by nathaleen »

Yes, I scanned the document a long time ago because too much hassle to carry around a full book. I will have to find a new copy and rescan it again. Might be difficult though. Thanks anyway.

Cheers
User avatar
Dimitar - Tracker Supp
Site Admin
Posts: 1778
Joined: Mon Jan 15, 2018 9:01 am

Re: Remove existing OCR layer?

Post by Dimitar - Tracker Supp »

Hello nathaleen,


If you can't get a fresh copy please try to improve the image by using this tool:


image.png


Regards.
Post Reply