Bulk OCR Existing Files in Folder SOLVED
Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Tracker Supp-Stefan
Bulk OCR Existing Files in Folder
I have a lot of pdf files I need to review in a folder and subfolders. I want to OCR anything in the folder and subfolders. I don't understand the save option in PDF-Tools. The "Save Document" part of the OCR tool doesn't seem to have the option to OCR each document and save without renaming or saving a new document. How do I OCR an existing file, save it, and move on to the next? How do I do that?
Re: Bulk OCR Existing Files in Folder
Last edited by Ovg on Sun May 02, 2021 3:41 pm, edited 1 time in total.
It's impossible to lead us astray for we don't care even to choose the way.
PDF-XChange PRO, 10.1.1 (Build 381) / W7 SP1 x64
PDF-XChange PRO, 10.1.1 (Build 381) / W7 SP1 x64
Re: Bulk OCR Existing Files in Folder
Ovg,
Thank you for your post. I agree. That is the window and the option in the bottom right. See how it will save a new file with an OCR name? I don't want that. I want PDF-Tools to open the file, OCR it, save it, and move on without creating new files or changing the file name.
Thank you for your post. I agree. That is the window and the option in the bottom right. See how it will save a new file with an OCR name? I don't want that. I want PDF-Tools to open the file, OCR it, save it, and move on without creating new files or changing the file name.
Re: Bulk OCR Existing Files in Folder SOLVED
It's impossible to lead us astray for we don't care even to choose the way.
PDF-XChange PRO, 10.1.1 (Build 381) / W7 SP1 x64
PDF-XChange PRO, 10.1.1 (Build 381) / W7 SP1 x64
Re: Bulk OCR Existing Files in Folder
OVG, you are the best! For some reason it didn't think of just saving it with the same file name.
Now, I am wondering why some documents didn't OCR, but that is another issue.
Now, I am wondering why some documents didn't OCR, but that is another issue.
Re: Bulk OCR Existing Files in Folder
Hi, bqxmprij
Check OCR settings:
It's impossible to lead us astray for we don't care even to choose the way.
PDF-XChange PRO, 10.1.1 (Build 381) / W7 SP1 x64
PDF-XChange PRO, 10.1.1 (Build 381) / W7 SP1 x64
- Tracker Supp-Stefan
- Site Admin
- Posts: 17824
- Joined: Mon Jan 12, 2009 8:07 am
- Location: London
- Contact:
Re: Bulk OCR Existing Files in Folder
Hello Ovg,
Many thanks for the help! Indeed that might be the reason why some files were skipper for bqxmprij.
@bqxmprij - please let us know if OVG's suggestion helped you sort everything out?
Kind regards,
Stefan
Many thanks for the help! Indeed that might be the reason why some files were skipper for bqxmprij.
@bqxmprij - please let us know if OVG's suggestion helped you sort everything out?
Kind regards,
Stefan
Re: Bulk OCR Existing Files in Folder
Of the three options, I used "do not OCR but continue processing." I don't know why some were not OCR'd.
I think there are 3 types of documents:
1. Documents with full text (e.g., computer generated pdfs) or any text.
2. Documents with no text (e.g., a scan).
3. Documents with both some text and some areas could be OCR'd but don't have text.
I think the options only contemplate 1 and 2. How do you OCR a document in category 3? In other words, I think we need (or let me know of) an option that reviews a document and OCRs non-text areas that could be OCR'd and ignores areas that already have text.
I think there are 3 types of documents:
1. Documents with full text (e.g., computer generated pdfs) or any text.
2. Documents with no text (e.g., a scan).
3. Documents with both some text and some areas could be OCR'd but don't have text.
I think the options only contemplate 1 and 2. How do you OCR a document in category 3? In other words, I think we need (or let me know of) an option that reviews a document and OCRs non-text areas that could be OCR'd and ignores areas that already have text.
- TrackerSupp-Daniel
- Site Admin
- Posts: 8440
- Joined: Wed Jan 03, 2018 6:52 pm
Re: Bulk OCR Existing Files in Folder
Hi, bqxmprij
To accomplish that, you would need to use the "ocr document" option (yes this does mean that all files, even those already containing text will be processed and cause the tool to take extra time), instead of the "do not OCR" option (which automatically skips any document containing any text based content at all).
With the OCR document function enabled, click "more options", and check off the options as you need: -The "skip pages" option will skip processing any page which contains any text based content at all, so enabling this would likely result un you skipping some pages in section 3.
-The "Ignore existing text on page" option will instead process the entire page, and skip areas which text already exists (meaning you will not get overlapping text). This process is the longest of the options presented to you, but will also give the most complete result.
Kind regards,
To accomplish that, you would need to use the "ocr document" option (yes this does mean that all files, even those already containing text will be processed and cause the tool to take extra time), instead of the "do not OCR" option (which automatically skips any document containing any text based content at all).
With the OCR document function enabled, click "more options", and check off the options as you need: -The "skip pages" option will skip processing any page which contains any text based content at all, so enabling this would likely result un you skipping some pages in section 3.
-The "Ignore existing text on page" option will instead process the entire page, and skip areas which text already exists (meaning you will not get overlapping text). This process is the longest of the options presented to you, but will also give the most complete result.
Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Re: Bulk OCR Existing Files in Folder
So, operator error.
Thank you!
Thank you!
- TrackerSupp-Daniel
- Site Admin
- Posts: 8440
- Joined: Wed Jan 03, 2018 6:52 pm
Bulk OCR Existing Files in Folder
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
- Jensen Head
- User
- Posts: 412
- Joined: Mon Sep 13, 2021 8:12 am
Re: Bulk OCR Existing Files in Folder
I would add that at the moment the "Ignore existing text on page" option does not take into account invisible text, i.e. obtained using the "Output Options" / "Type: Searchable Image" setting. Thus, the application considers that the text in the images is not recognized, and recognizes it again, duplicating the already existing text blocks. This may be undesirable for two reasons. First, when copying several paragraphs and pasting them into another application, you can end up with consecutively repeating pieces of text. You may not notice this (very bad), or spend time fixing a broken text fragment (bad). Secondly, after being indexed by some search engines, instead of the text "ignore existing text on page" in the preview in the web search results, you will get "iiggnnoorree eexxiissttiinngg tteexxtt oonn ppaaggee".TrackerSupp-Daniel wrote: ↑Mon May 03, 2021 7:13 pmThe "Ignore existing text on page" option will instead process the entire page, and skip areas which text already exists (meaning you will not get overlapping text).
This problem was discussed in the topics "OCR option — ignore existing text on page" (#35211) and "Multiple OCR Runs —> Duplicate Text Objects" (#34214).
- TrackerSupp-Daniel
- Site Admin
- Posts: 8440
- Joined: Wed Jan 03, 2018 6:52 pm
Re: Bulk OCR Existing Files in Folder
Hello, Jensen Head
Are you running the current latest release (366.0)? That issue should have been fixed already, I will need to run some tests, but last I checked, it was properly ignoring all areas of the page that contain text content, visible or otherwise.
[Update: I ran that test in the current release, it seems that the handling for this was changed, if there is invisible text, it is entirely replaced with the editable text. So no duplication occurs as you were worried about, but you are correct that it is not ignored, it is simply replaced. It seems only visible text areas are actually ignored.]
Kind regards,
Are you running the current latest release (366.0)? That issue should have been fixed already, I will need to run some tests, but last I checked, it was properly ignoring all areas of the page that contain text content, visible or otherwise.
[Update: I ran that test in the current release, it seems that the handling for this was changed, if there is invisible text, it is entirely replaced with the editable text. So no duplication occurs as you were worried about, but you are correct that it is not ignored, it is simply replaced. It seems only visible text areas are actually ignored.]
Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
- Jensen Head
- User
- Posts: 412
- Joined: Mon Sep 13, 2021 8:12 am
Re: Bulk OCR Existing Files in Folder
You're right, in version 366.0, recognizing a "Searchable Image" document with the "Ignore existing text on page" checkbox disabled does not result in duplicate text blocks. Thank you!
- Dimitar - Tracker Supp
- Site Admin
- Posts: 1778
- Joined: Mon Jan 15, 2018 9:01 am