How can one correct the OCR'd text in a document?

Discussion for the End User use of OCR in PDF-XChange Editor and Viewer

Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan

Post Reply
coffent
User
Posts: 19
Joined: Sun Aug 10, 2008 9:07 pm

How can one correct the OCR'd text in a document?

Post by coffent »

I've copied and pasted into Notepad the text from a document I OCR'd and see there are numerous errors. How can these be corrected in the PDF document? I've looked under both the Convert and the Review headings and don't see anything relevant. Thanks.
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8440
Joined: Wed Jan 03, 2018 6:52 pm

Re: How can one correct the OCR'd text in a document?

Post by TrackerSupp-Daniel »

Hello Coffent,

Text resulting from the OCR operation is "base content" text and thus can be edited with the "Edit text" tool, on the home tab:
image.png
I hope this helps!
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
SJH
User
Posts: 11
Joined: Fri Jul 10, 2020 2:17 am

Re: How can one correct the OCR'd text in a document?

Post by SJH »

Hello,
I was attracted to your product because of the OCR, also, that it is available as 'Portable':

1)For my needs, an OCR product should have strong text editing ability, and particularly ,
robust 'Find and Replace'. I'm hoping you have, good 'Find and Replace' (for
OCR text), but I couldn't find it ?

Example: I used your OCR on a good quality, test scan, (good resolution, easily
recognizable font). Easy test and the recognition was excellent, but for
one error. Every instance of the letter 'a', with a space on either side,
was recognized as the letter 'o' (" a boy" was recognized as " o boy",
One error with *innumerable* instances. Not an issue with 'Find and
Replace' :) . Tedious, time consuming editing, without :( .

2)Am I correct, that PDF-XChange Editor, and Editor Plus, is available in portable version,
but PDF-XChange Editor Pro is not (portable version download was not an option) ?

Thank-You
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8440
Joined: Wed Jan 03, 2018 6:52 pm

Re: How can one correct the OCR'd text in a document?

Post by TrackerSupp-Daniel »

Hi, SJH

Unfortunately we do not currently offer a find and replace function, but it has been a topic of discussion and considering recently. I cannot make any promises for a timeline as it is a much more complex function that you might expect, but it is something that we hope to eventually be able to offer.

As for the portable versions, There is no product called "Editor PRO", the PDF-XChange PRO bundle which you see on the website is a package including all three of our most popular products, The PDF-Tools batch processing utility, the PDF-Xchange Standard printer, and the PDF-XChange Editor Plus. From this package, only the Editor is able to run as a portable app, so we do not offer a "portable" download for the PRO bundle. You can however, download the portable Editor from the product page, and if you hold a PRO key, it will correctly cover the portable version as well.

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
SJH
User
Posts: 11
Joined: Fri Jul 10, 2020 2:17 am

Re: How can one correct the OCR'd text in a document?

Post by SJH »

Thank-you for your reply Daniel,

> it has been a topic of discussion

Thanks for pointing that out, I had tried several search strings but nothing came up. After reading your reply, I tried others and found the threads, that you referenced.

> it is something that we hope to eventually be able to offer

I gather, that hope, goes back, quite a few years ;-) But at least I can stop looking for 'Replace', in the program, and, revert to finding work-arounds .

> You can however, download the portable Editor from the product page, and if you hold a PRO key, it will correctly cover the portable version as well.

Thank-you. I will assume (until you tell me differently) that 'PDF-Tools' and 'PDF-Xchange Standard Printer' are independent from,
'PDF-XChange Editor Plus', so that, it makes no difference to the former two, whether 'PDF-XChange Editor' is system installed, or portable.

Sorry, It was undoubtedly the wrong forum to ask that question, but wanted to finish the thought.

SJH
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8440
Joined: Wed Jan 03, 2018 6:52 pm

Re: How can one correct the OCR'd text in a document?

Post by TrackerSupp-Daniel »

Hi, SJH

No worries, its part of the discussion, so there are no "off topic" questions. You are correct that Tools and Standard are separate products, though the PRO installer would include all three. I cannot say that the others would be unaffected however, as it is not possible to install Tools without also installing the Editor, they are intrinsically intertwined and reference many of the same functions and files at this stage.

The advantage to the Portable Editor is that it does not require installation, you can actually have multiple versions of it present on a single machine (and still have the installed version present, separate from those), as I do for testing purposes.

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
SJH
User
Posts: 11
Joined: Fri Jul 10, 2020 2:17 am

Re: How can one correct the OCR'd text in a document?

Post by SJH »

Thank-you for the reply Daniel,

I purchased a license for PDF-XChange Editor Plus + Enhanced OCR Plugin + 3 year Maintenance
and decided to do the system install, since the suggested download was an all in one bundle
(including 'Printer Lite'), easy to install, and likely the easiest way to keep updated.

As per you suggestion, I will use the Portable Editor, for added functionality.

The 3 year Maintenance was a no-brainer. Sounds like great things to come. I watched
some of your build vids and was impressed by your development program to date.

Thanks
SJH
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8440
Joined: Wed Jan 03, 2018 6:52 pm

How can one correct the OCR'd text in a document?

Post by TrackerSupp-Daniel »

:)
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Jochn
User
Posts: 3
Joined: Sat Dec 05, 2020 8:49 pm

Re: How can one correct the OCR'd text in a document?

Post by Jochn »

TrackerSupp-Daniel wrote: Wed Dec 18, 2019 8:35 pm Text resulting from the OCR operation is "base content" text and thus can be edited with the "Edit text" tool, on the home tab:
Hi Daniel,

this helps, but the text is invisible which makes it hard do correct. I discovered that I can view the text in the content pane, which is helpful. It would be nice if I could edit the text directly in the context pane.

In my experience, documents I want to OCR contain not only text but images and some kind of logos as well. Mostly parts of images or logos are regognized as some character, which is not an issue if the nearby text does not get messed up.

Sometimes a word is divided in multiple parts of text. Is there a way to combine these parts? If not, in addition to the text editing in the context pane, it would be nice if one could combine text parts in the content pane.
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8440
Joined: Wed Jan 03, 2018 6:52 pm

Re: How can one correct the OCR'd text in a document?

Post by TrackerSupp-Daniel »

Hi, Jochn

If you are running the latest version, the invisible text will become temporarily visible during editing. Yes it would overlay the original and make is hard to read, but due to the layout in PDF, text wrapping and the like, it is not possible to edit text content directly, such as through the content pane. the only way to do this is directly within the Document.

As for combining text parts, the simplest way is to delete the extras and then edit the main portion of the text and add them back in manually.

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Jochn
User
Posts: 3
Joined: Sat Dec 05, 2020 8:49 pm

Re: How can one correct the OCR'd text in a document?

Post by Jochn »

TrackerSupp-Daniel wrote: Tue Dec 08, 2020 5:49 pm If you are running the latest version, the invisible text will become temporarily visible during editing.
Hi Daniel,

please have a look at the pictures of what I do, the version and settings used.
I do not see the text I'm editing. Do I have to change some settings?
OCR_Setting.png
These are the settings. I use the enhanced OCR engine in high accuracy setting and have the editor create a new file with text overlay as searchable image. Why I do not use the other settings becomes clear in the result.
OCR_Result.png
In the new file with text overlay I can chose "edit text", the cursor position should be visible, the text layer is still invisible to me.
Even if I double click so a selection is marked, the text layer stays invisible to me.
I can copy it, yes.
Pasted in a text editor one can see why I want to keep the original image and have a invisible text layer.

The OCR replaced the 8 with a 6.

Big deal especially if there is no reference like the original image. But this is a different topic.

Regarding manual OCR correction. I've worked with a competitors software, where one can click on so called suspects and change them. While changing the recognized text and the original text is displayed. A nice solution IMHO.
User avatar
Dimitar - Tracker Supp
Site Admin
Posts: 1778
Joined: Mon Jan 15, 2018 9:01 am

Re: How can one correct the OCR'd text in a document?

Post by Dimitar - Tracker Supp »

Hello Jochn,

From the screenshot, I can see that you have set the Output to Searchable image. This is what makes the text layer invisible.

Please set it to "Editable Text and Images" or "Fine Content".


image.png


Also, if the image is not in the best look, please use the Enhance Scans option to try to improve its quality, in order to avoid most of the mistakes.



Regards.
Jochn
User
Posts: 3
Joined: Sat Dec 05, 2020 8:49 pm

Re: How can one correct the OCR'd text in a document?

Post by Jochn »

Dimitar - Tracker Supp wrote: Tue Dec 22, 2020 3:08 pm Please set it to "Editable Text and Images" or "Fine Content".
Hi Dimitar,

thank you for the reply. I think I explained why anything other than searchable image is not an option.

In addition to incorrect OCR, the other options cannot always handle logos, and tables.
Sometimes tables are recognized, sometimes cells will be different from the original, logos are partly recognized as text symbols, ...
In short anything other than plain text lead to a basically unusable document in my experience.

Please take the manual OCR correction of an invisible text layer als a request for future improvements.
Willy Van Nuffel
User
Posts: 2347
Joined: Wed Jan 18, 2006 12:10 pm

Re: How can one correct the OCR'd text in a document?

Post by Willy Van Nuffel »

So, a possible work-around is to 'temporarily' change the color-properties of the invisible text-layer, to make editing possible:
- activate the Content pane (View-menu > Other Panes > Content /or/ View-ribbon > Panes > Content)
- activate the Properties pane (View-menu > Other Panes > Properties Pane /or/ View-ribbon > Panes > Properties Pane)
- select all the (invisible) text in the Content pane
- in the Properties pane, change the Fill Color from "none" to "red" (for example)
- in the toolbar or in the Home-ribbon, click the Edit-icon, and click "Text"
- it should now be possible to make visible corrections in the text
(click once and then once again to put the cursor at the desired position in the text)
- once your corrections have been done, you can change the Fill Color of all the text back from "red" to "none"

Is this helpful for you?
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8440
Joined: Wed Jan 03, 2018 6:52 pm

Re: How can one correct the OCR'd text in a document?

Post by TrackerSupp-Daniel »

Hi, Willy Van Nuffel

Thank you for providing a possible solution to Jochn's report here, I hope that it helps him.

Jochn, In the meantime, I think that Willy's suggestion might be the best option for you. I do need to note that with the release of V9 we are planning some fairly substantial updates to our Enhanced OCR capabilities, so this should hopefully be an interim method until that is available. V9 is planned to release in Mid January, so please do look forward to that update, and please check to see if it handles your files better once it is available.

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Post Reply