OCR pages contain text option  SOLVED

This Forum is for the use of End Users requiring help and assistance for Tracker Software's PDF-Tools.

Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Tracker Supp-Stefan

Post Reply
fletch
User
Posts: 79
Joined: Wed Mar 11, 2020 2:53 am

OCR pages contain text option

Post by fletch »

Trying to understand this option. Presumably SOME of my PDF's might contain SOME text since they are sourced from a variety of places. So without examining EACH of them I can't use that as an indicator for whether it needs to be OCR'd or not.

Problem is, in processing 5gb worth overnight, it apparently crashed. The UI was gone. So now I have to do small batches since I don't know what's "remaining" unprocessed.

Would be useful if a keyword could somehow be added during processing to indicate the document had been OCR'd.
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8436
Joined: Wed Jan 03, 2018 6:52 pm

Re: OCR pages contain text option

Post by TrackerSupp-Daniel »

Hi, fletch

The OCR pages action now offers an option to skip documents already containing text.
image.png
If you set this option to "skip processing the document" it will immediately stop trying to process that document if any text content is present prior to OCR. Meaning it will not re-process any files you have already OCR'd.

Beyond that, the OCR pages tool will by default add "_(OCR-ed)" to the name of any files which have been processed. If you are using the default tool, you will be able to use this as an indicator.
image1.png
image1.png (17.42 KiB) Viewed 1747 times
Note that tools, when in batch processing mode (the default mode), all files will be processed through each step before the next step begins. This means that if Tools crashed during the OCR step, it would never have begun the Save step.

To get around this, you can disable "batch processing mode" with the checkbox at the top of the tool:
image2.png
image2.png (13.17 KiB) Viewed 1747 times
I hope this helps!

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
fletch
User
Posts: 79
Joined: Wed Mar 11, 2020 2:53 am

Re: OCR pages contain text option

Post by fletch »

I've not looked, but I suspect SOME PDF's I have DO have text present in some form - yet not a FULL set of text that would be present after being OCR'd. So the safe method seemed to be FORCE OCR on every document. Ok, so I just sampled a document at random, and "technically" it DOES seem to contain "text". So am I correct in assuming that this document would be skipped? If so, then you see my predicament.

image.png

I saw the default output name offered but I just want to replace my existing files with OCR'd files. Now, if the tool would allow me to EXCLUDE files with ODR-ed in the name, THEN that would be useful. I also just noticed an option to alter the file properties - for example adding a keyword. If I could "program" the process to ignore files with a certain keyword, then I could re-do a mass scan at a later date and it would only OCR files that had not previously been OCR'd. Granted once I get past this initial pain-point, then the proper thing to do - which I intend to make a habit of - is to OCR documents as they are added.


Regarding the batch processing option. In my case, what STEPS are there aside from the OCR step? Maybe Show Files is considered a "step". Either way I think - for my purposes - turning off Batch Processing makes sense. After testing that I think I see in the progress bar that it's doing each document one-at-a-time. Ok, as I continue to ramble about Batch Processing it's now clear why it probably crashed. Probably ran out of system resources since 5Gb is allot of OCR to cache somewhere before "saving" it. Just gotta get used to this tool. So I'll turn off Batch and re-run tonight. Though in my testing just now when I hit stop during processing the tool closed/crashed. Not terribly important though.
fletch
User
Posts: 79
Joined: Wed Mar 11, 2020 2:53 am

Re: OCR pages contain text option

Post by fletch »

turned off batch, tried a small set of 119 files, when I came back the app was closed.
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8436
Joined: Wed Jan 03, 2018 6:52 pm

Re: OCR pages contain text option

Post by TrackerSupp-Daniel »

Hi, fletch

Regarding running OCR, you are correct that it would skip that document because there is that small amount of text present. As an alternative, you can change the naming scheme so that "OCR_" appears at the beginning of the file name, allowing you to sort those files alphabetically and quickly select all the files which do not have OCR at the beginning of the name. For the moment at least, special sorting or filtering as you have described is not available, but we may consider it in the future.

For batch processing, each "action" that you see in the list on the right column of tools is a "step" "Choose input > OCR > Properties (skipped) > save > view":
image1.png
image1.png (26 KiB) Viewed 1712 times
I hope this helps.

As for the crash, to be certain, are you currently running the latest release of PDF-Tools? You can check what version and build are running via help > About:
image.png
Do check what the build number you see there is, and if it is not build 341.0, please update the software then try again.

Once you are certain you are running the latest version check if the crash is still present. If it is, please follow the steps here to setup crash logging, and then recreate the crash while processing once more. Once that is done, please send us a link to the file that was created, or upload it to our useruploads server here.

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
fletch
User
Posts: 79
Joined: Wed Mar 11, 2020 2:53 am

Re: OCR pages contain text option

Post by fletch »

Ah, "steps" - I understand that better now - thanks.

Naming the files and sorting them so I can exclude them is not feasible. Because...

5,681,409,443 bytes in 17,872 file(s),
in 1,210 directories

:mrgreen:

I'm using the latest version. I'd installed it in a clean machine to test with my data before modifying the originals. If I'm able to repro the crash, I'll submit the details.
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8436
Joined: Wed Jan 03, 2018 6:52 pm

OCR pages contain text option

Post by TrackerSupp-Daniel »

:)
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
fletch
User
Posts: 79
Joined: Wed Mar 11, 2020 2:53 am

Re: OCR pages contain text option

Post by fletch »

TrackerSupp-Daniel wrote: Thu Sep 24, 2020 5:42 pm
As an alternative, you can change the naming scheme so that "OCR_" appears at the beginning of the file name, allowing you to sort those files alphabetically and quickly select all the files which do not have OCR at the beginning of the name. For the moment at least, special sorting or filtering as you have described is not available, but we may consider it in the future.
As I mentioned later, that scheme will not work because I'm potentially scanning thousands of files in multiple folders, not just files in a single folder that I could sort by name and thus not select those named OCR_.

So, I just now accidentally discovered more capabilities while "editing" a cloned tool. Actions Library is not something I've seen before. I thought I'd found a solution in Filter Files. But it only allows me to select - not exclude files. So a checkbox to exclude matching files would do it. Maybe a future enhancement.

https://help.pdf-xchange.com/pdfxt8/filter-files.html?zoom_highlightsub=filter+files
fletch
User
Posts: 79
Joined: Wed Mar 11, 2020 2:53 am

Re: OCR pages contain text option

Post by fletch »

image.png
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8436
Joined: Wed Jan 03, 2018 6:52 pm

Re: OCR pages contain text option  SOLVED

Post by TrackerSupp-Daniel »

Hi, fletch

Sorry about the delay in replying here, I have created a formal feature request on this for you:

#5377: FR: Tools "filter" to skip documents containing keywords

AS usual, while I cannot make an guarantees of implementation or timelines, I can promise that the dev team will seriously consider this option when we are next looking at new features.

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
fletch
User
Posts: 79
Joined: Wed Mar 11, 2020 2:53 am

Re: OCR pages contain text option

Post by fletch »

Understood and appreciated.
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8436
Joined: Wed Jan 03, 2018 6:52 pm

OCR pages contain text option

Post by TrackerSupp-Daniel »

:)
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Post Reply