How to filter files you dont want / filter type "leave all except this tipe"[SOLVED]

cunha00 · Post by **cunha00** » Wed Nov 06, 2019 4:14 am

I would like to filter off those files I already did ocr. They all have a sufic -OCR.PDF

I would like to do something like: *.* NOT *-OCR.PDF
Or: -*-OCR.PDF

Is there some wildcard I cant use to achieve this?

Wed Nov 06, 2019 10:44 am

Hello, cunha00

Unfortunately, for now it is not possible to filter off(do not process) files that contain specific characters in the file name.
You can only filter files by name that matches pattern.

It is not clear what kind of work you exactly want do with PDF-Tools(filter files only or apply some action too).

In general, I can suggest some possible workrounds for you:
1. If you are using "OCR Pages" action from PDF-Tools and your not OCR-ed documents doesn't not contain text you can use option of the "OCR Pages" action: "If document contains text - Skip processing the document".

2. You can filter files in andvance using Windows Command Line and combine a *.pdtfl file that can be passed as input to any Tool.
For example you can use command:

Code: Select all

dir /s /b /a-d | findstr /v /r ".*-OCR.pdf" | findstr ".*.pdf" >> input.pdtfl

where ".*-OCR.pdf" is Regular Expression pattern for name of files you want to ignore.

After you got input.pdtfl you can pass it as input of any Tool in PDF-Tools(no matter in UI or in Command Line):
- In UI turn on option "Select file list" in "Choose Input Files" action.
- In Command Line pass it as argument of /RunTool command, e.g. "PDFXTools.exe /RunTool <tool-id> input.pdftl

Best regards,
vmgoshko

cunha00 · Post by **cunha00** » Wed Nov 06, 2019 12:13 pm

I´ think I´ll take your suggestions.

On my case it is a very large nested folder with legal documetns on it and I dont want to OCR be redone. I can´´t use the first option because some PDFs have page with OCR and other pages without it, but the command line file creation will work for me!

Thanks for the feedback and available options

Post by **TrackerSupp-Daniel** » Wed Nov 06, 2019 8:27 pm

If there are some documents with text content already existing, you can setup the OCR action itself to "skip pages with text content", or "ignore existing text on page" so that none of the text is duplicated:

Using that in conjunction with the prior suggested command line options should help you work through all of the files there without altering the existing text.
I hope this helps!

How to filter files you dont want / filter type "leave all except this tipe"[SOLVED]

How to filter files you dont want / filter type "leave all except this tipe"[SOLVED]

Re: How to filter files you dont want / filter type "leave all except this tipe"

Re: How to filter files you dont want / filter type "leave all except this tipe"

Re: How to filter files you dont want / filter type "leave all except this tipe"[SOLVED]