Bulk OCR Existing Files in Folder  SOLVED

This Forum is for the use of End Users requiring help and assistance for Tracker Software's PDF-Tools.

Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Tracker Supp-Stefan

Post Reply
bqxmprij
User
Posts: 162
Joined: Tue Dec 18, 2012 3:51 am

Bulk OCR Existing Files in Folder

Post by bqxmprij »

I have a lot of pdf files I need to review in a folder and subfolders. I want to OCR anything in the folder and subfolders. I don't understand the save option in PDF-Tools. The "Save Document" part of the OCR tool doesn't seem to have the option to OCR each document and save without renaming or saving a new document. How do I OCR an existing file, save it, and move on to the next? How do I do that?
User avatar
Ovg
User
Posts: 461
Joined: Tue Sep 05, 2017 4:56 pm

Re: Bulk OCR Existing Files in Folder

Post by Ovg »

20210502_183622.png
Last edited by Ovg on Sun May 02, 2021 3:41 pm, edited 1 time in total.
It's impossible to lead us astray for we don't care even to choose the way.
PDF-XChange PRO, 10.1.1 (Build 381) / W7 SP1 x64
bqxmprij
User
Posts: 162
Joined: Tue Dec 18, 2012 3:51 am

Re: Bulk OCR Existing Files in Folder

Post by bqxmprij »

Ovg,

Thank you for your post. I agree. That is the window and the option in the bottom right. See how it will save a new file with an OCR name? I don't want that. I want PDF-Tools to open the file, OCR it, save it, and move on without creating new files or changing the file name.
User avatar
Ovg
User
Posts: 461
Joined: Tue Sep 05, 2017 4:56 pm

Re: Bulk OCR Existing Files in Folder  SOLVED

Post by Ovg »

20210502_185334.png
It's impossible to lead us astray for we don't care even to choose the way.
PDF-XChange PRO, 10.1.1 (Build 381) / W7 SP1 x64
bqxmprij
User
Posts: 162
Joined: Tue Dec 18, 2012 3:51 am

Re: Bulk OCR Existing Files in Folder

Post by bqxmprij »

OVG, you are the best! For some reason it didn't think of just saving it with the same file name.

Now, I am wondering why some documents didn't OCR, but that is another issue.
User avatar
Ovg
User
Posts: 461
Joined: Tue Sep 05, 2017 4:56 pm

Re: Bulk OCR Existing Files in Folder

Post by Ovg »

bqxmprij wrote: Sun May 02, 2021 7:52 pm Now, I am wondering why some documents didn't OCR, but that is another issue.

Hi, bqxmprij
Check OCR settings:

20210503_102850.png
It's impossible to lead us astray for we don't care even to choose the way.
PDF-XChange PRO, 10.1.1 (Build 381) / W7 SP1 x64
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17820
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: Bulk OCR Existing Files in Folder

Post by Tracker Supp-Stefan »

Hello Ovg,

Many thanks for the help! Indeed that might be the reason why some files were skipper for bqxmprij.

@bqxmprij - please let us know if OVG's suggestion helped you sort everything out?

Kind regards,
Stefan
bqxmprij
User
Posts: 162
Joined: Tue Dec 18, 2012 3:51 am

Re: Bulk OCR Existing Files in Folder

Post by bqxmprij »

Of the three options, I used "do not OCR but continue processing." I don't know why some were not OCR'd.

I think there are 3 types of documents:
1. Documents with full text (e.g., computer generated pdfs) or any text.
2. Documents with no text (e.g., a scan).
3. Documents with both some text and some areas could be OCR'd but don't have text.

I think the options only contemplate 1 and 2. How do you OCR a document in category 3? In other words, I think we need (or let me know of) an option that reviews a document and OCRs non-text areas that could be OCR'd and ignores areas that already have text.
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8436
Joined: Wed Jan 03, 2018 6:52 pm

Re: Bulk OCR Existing Files in Folder

Post by TrackerSupp-Daniel »

Hi, bqxmprij

To accomplish that, you would need to use the "ocr document" option (yes this does mean that all files, even those already containing text will be processed and cause the tool to take extra time), instead of the "do not OCR" option (which automatically skips any document containing any text based content at all).
With the OCR document function enabled, click "more options", and check off the options as you need:
image.png
-The "skip pages" option will skip processing any page which contains any text based content at all, so enabling this would likely result un you skipping some pages in section 3.
-The "Ignore existing text on page" option will instead process the entire page, and skip areas which text already exists (meaning you will not get overlapping text). This process is the longest of the options presented to you, but will also give the most complete result.

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
bqxmprij
User
Posts: 162
Joined: Tue Dec 18, 2012 3:51 am

Re: Bulk OCR Existing Files in Folder

Post by bqxmprij »

So, operator error.

Thank you!
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8436
Joined: Wed Jan 03, 2018 6:52 pm

Bulk OCR Existing Files in Folder

Post by TrackerSupp-Daniel »

:)
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
User avatar
Jensen Head
User
Posts: 412
Joined: Mon Sep 13, 2021 8:12 am

Re: Bulk OCR Existing Files in Folder

Post by Jensen Head »

TrackerSupp-Daniel wrote: Mon May 03, 2021 7:13 pmThe "Ignore existing text on page" option will instead process the entire page, and skip areas which text already exists (meaning you will not get overlapping text).
I would add that at the moment the "Ignore existing text on page" option does not take into account invisible text, i.e. obtained using the "Output Options" / "Type: Searchable Image" setting. Thus, the application considers that the text in the images is not recognized, and recognizes it again, duplicating the already existing text blocks. This may be undesirable for two reasons. First, when copying several paragraphs and pasting them into another application, you can end up with consecutively repeating pieces of text. You may not notice this (very bad), or spend time fixing a broken text fragment (bad). Secondly, after being indexed by some search engines, instead of the text "ignore existing text on page" in the preview in the web search results, you will get "iiggnnoorree eexxiissttiinngg tteexxtt oonn ppaaggee".

This problem was discussed in the topics "OCR option — ignore existing text on page" (#35211) and "Multiple OCR Runs —> Duplicate Text Objects" (#34214).
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8436
Joined: Wed Jan 03, 2018 6:52 pm

Re: Bulk OCR Existing Files in Folder

Post by TrackerSupp-Daniel »

Hello, Jensen Head

Are you running the current latest release (366.0)? That issue should have been fixed already, I will need to run some tests, but last I checked, it was properly ignoring all areas of the page that contain text content, visible or otherwise.

[Update: I ran that test in the current release, it seems that the handling for this was changed, if there is invisible text, it is entirely replaced with the editable text. So no duplication occurs as you were worried about, but you are correct that it is not ignored, it is simply replaced. It seems only visible text areas are actually ignored.]

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
User avatar
Jensen Head
User
Posts: 412
Joined: Mon Sep 13, 2021 8:12 am

Re: Bulk OCR Existing Files in Folder

Post by Jensen Head »

You're right, in version 366.0, recognizing a "Searchable Image" document with the "Ignore existing text on page" checkbox disabled does not result in duplicate text blocks. Thank you!
User avatar
Dimitar - Tracker Supp
Site Admin
Posts: 1778
Joined: Mon Jan 15, 2018 9:01 am

Bulk OCR Existing Files in Folder

Post by Dimitar - Tracker Supp »

:)
Post Reply