Attempting to use OCR to allow Sorting pages Using Windows Searching Contents, OCR Number Accuracy... Recommendations?

This Forum is for the use of End Users requiring help and assistance for Tracker Software's PDF-Tools.

Moderators: TrackerSupp-Daniel, Tracker Support, Chris - Tracker Supp, Vasyl-Tracker Dev Team, Sean - Tracker, Tracker Supp-Stefan

Post Reply
NoteApplicabble
User
Posts: 9
Joined: Sat Oct 02, 2021 8:52 pm

Attempting to use OCR to allow Sorting pages Using Windows Searching Contents, OCR Number Accuracy... Recommendations?

Post by NoteApplicabble » Tue Oct 12, 2021 8:18 pm

I am trying to digitally sort Ballot images by Precinct from locations that had voters from multiple precincts voting at a single location. (Early voting, Absentee Ballots, where they were all mixed up. Election ballot images were not saved during the election. And in re-scanning pages, there is no presorting happening.

I'm hoping someone might have a recommendation on how to tighten up the accuracy.

Steps I'm attempting to use PDF-Tools to facilitate this process:
  1. Split 100-400 page batches of ballot scans into individual pages, place them into a single folder,
  2. Add OCR to individual pages after having been split,
  3. After OCR has been applied I want to be able with 100% accuracy to search the contents of PDF documents within Windows 10 for a precinct number & ballot variation.
    (Example is searching: "1615-A" 1615 is the precinct number in the top right corner of the ballot and A is the variation of the ballot items for different voters within that one precinct.)

    By doing this I want to find every ballot page image for that precinct to be able to move it into an individual Windows folder with contents limited to only that precinct.
I am having about 1% of the ballots being missed because the OCR process is misreading, etc. the Precinct number and Ballot Variation or other applications can't seem to search the document at all.

Some Ballot variation are being interpreted like the following. Instead of: "2117-A" it is creates as: "2 117 -A"; & "1 4 1 9 -A" instead of "1419-A"

Some documents are searchable in PDF-XChange Editor but not in Adobe Acrobat Reader DC. I wonder if the same trouble Adobe is having trouble searching the document is the same reason Windows 10 can't find the contents of the document either.

I'm currently testing with 564 pages and I'm getting about 4 ballots each time that don't get identified out of these three large batches I'm testing with.
(Not a big deal? It is if I am trying to get the kinks worked out in preparation to handle +/-200,000 ballots, approx 4 pages each.)
COVID STE BATCH 2_203_(OCR).pdf
(165.67 KiB) Downloaded 7 times
EOCR - Selected Options.png
COVID STE BATCH 1_151_(OCR).pdf
(87.13 KiB) Downloaded 4 times
PDF-Tools (Tool Info Arrangement).png

User avatar
TrackerSupp-Daniel
Site Admin
Posts: 4989
Joined: Wed Jan 03, 2018 6:52 pm

Re: Attempting to use OCR to allow Sorting pages Using Windows Searching Contents, OCR Number Accuracy... Recommendation

Post by TrackerSupp-Daniel » Tue Oct 12, 2021 9:57 pm

Hello, NoteApplicabble

First, I should note that the quality of these scans is quite sub-par, and so any OCR engine will struggle to scan all the contents properly, though the titles should be coming out just fine. If at all possible, you should try to get these re-scanned at a higher quality, as that will be the most easily noticeable improvement to the OCR output.

Second, Choosing "Accuracy: High" is not what should be done here. The Accuracy setting is a definition of the document quality, so "high" accuracy means we will assume that the file is pristine, with no blemishes, and clear characters to easily identify (not anti-aliasing or other image effects, nine times out of ten, this means an image which has never left the digital format). You should be using "Auto" for these documents, or if that does not work, possibly even "Low" accuracy to see which gives the best results.

Third, regarding search after OCR, you will want to double check that our "search" handler is enabled, by selecting our "IFilter handler" in our shell extensions setup utility, as detailed here, and then restarting the PC.
Note that in some rare cases, windows indexing can interfere with Ifilter and prevent us from utilizing our features. If you find that you cannot search even after doing this, please disable windows indexing for the folder you are searching, and see if that helps at all.

Kind regards,
Daniel McIntyre
Support Technician
Tracker Software Products (Canada) LTD

Support: <Support@tracker-software.com>
Sales: +1 (250) 324-1621
Fax: +1 (250) 324-1623

NoteApplicabble
User
Posts: 9
Joined: Sat Oct 02, 2021 8:52 pm

Re: Attempting to use OCR to allow Sorting pages Using Windows Searching Contents, OCR Number Accuracy... Recommendation

Post by NoteApplicabble » Wed Oct 13, 2021 4:55 pm

Getting the same issue with original 600dpi resolution. See attached.

Batched that through last night and I have (13) pages where the "Enhanced (FineReader)" EOCR did not identify "Precinct" or a single individual number included in the precinct number, the "-" or "A" at the end of them.
The OCR "Recognition Options" "Accuracy" setting was set to "High".

Taking the PDF to TXT option, I can't find any of the above listed characters of the Precinct numbers recording the precinct number or ballot variation "A" either.
(Can't upload .txt files to show that too reflects not picking up the ballot precinct number or ballot identifier letter.)

In part, I'm getting better results (no phantom spaces etc. getting inserted into the middle of my precinct numbers) but in other respects I'm getting worse results in that it is completely missing the word "Precinct" or any of the (4) digit precinct # digits or the ballot variation "-(letter)"

Any help or advice would be appreciated from community or developers, etc.
COVID STE BATCH 1_29_(OCR).pdf
(1.43 MiB) Downloaded 2 times
Attachments
COVID STE BATCH 1_45_(OCR).pdf
(1.43 MiB) Downloaded 3 times

NoteApplicabble
User
Posts: 9
Joined: Sat Oct 02, 2021 8:52 pm

Re: Attempting to use OCR to allow Sorting pages Using Windows Searching Contents, OCR Number Accuracy... Recommendation

Post by NoteApplicabble » Wed Oct 13, 2021 9:39 pm

TrackerSupp-Daniel wrote:
Tue Oct 12, 2021 9:57 pm

First, I should note that the quality of these scans is quite sub-par, and so any OCR engine will struggle to scan all the contents properly, though the titles should be coming out just fine...

Second, Choosing "Accuracy: High" is not what should be done here...

Third, regarding search after OCR, you will want to double check that our "search" handler is enabled, by selecting our "IFilter handler" in our shell extensions setup utility, as detailed "... URL...", and then restarting the PC...
Note that in some rare cases, windows indexing can interfere with Ifilter and prevent us from utilizing our features. If you find that you cannot search even after doing this, please disable windows indexing for the folder you are searching, and see if that helps at all.
Testing now with 600dpi
Essentially the consistency is increased. I now reliably get issues in 13 of the 564 pages no matter what the OCR settings I've tried. My first test with Accuracy left at High for 600dpi scans (against "TrackerSupp-Daniel" advice) gave similar problem finding text.

Tested the same 600dpi scans using the OCR Accuracy at the Auto setting. I still get (13 files) that don't include the precinct number or ballot variation letter in OCR and therefore can't be searched for.
COVID STE BATCH 1__65__(OCR).pdf
(1.43 MiB) Downloaded 5 times
I'm attaching a few of the pages that don't locate anything when searching for the even a single digit of the precinct number.[/b]

I've migrated over to demoing File Juggler ... for my file sorting efforts if I can get OCR to detect all of the Precinct number and ballot variation letters. (But I did run the "XCShInfoSetup.exe" utility and chose "Select All". I did not see a drop down or box for "Search" handler specifically. They all had other names.

(I need to find a thread around hear where someone wanted to know if PDF-Tools etc. could search contents of files and sort for them in case the File Juggler might be a temporary useful companion to help them while that is not an include feature in Tracker Software that is offered to date.)
Attachments
COVID STE BATCH 1__45__(OCR).pdf
(1.43 MiB) Downloaded 3 times
COVID STE BATCH 1__29__(OCR).pdf
(1.43 MiB) Downloaded 2 times

NoteApplicabble
User
Posts: 9
Joined: Sat Oct 02, 2021 8:52 pm

Re: Attempting to use OCR to allow Sorting pages Using Windows Searching Contents, OCR Number Accuracy... Recommendation

Post by NoteApplicabble » Wed Oct 13, 2021 10:31 pm

Using 600dpi PDF pages... and putting the OCR Accuracy to "Low" it only missed one page in the search. I'm attaching that page.
COVID STE BATCH 2__153__(OCR).pdf
(1.9 MiB) Downloaded 2 times
If I open that file in "PDF-XChange Editor" and "Convert" with "OCR Page(s)" and used the "Low" accuracy setting. I can search the "1516-A" and locate the precinct number and the ballot variation letter.
COVID STE BATCH 2__153__(OCR) Searchable.pdf
(1.9 MiB) Downloaded 3 times
When I experimented running the previously missed and attached PDF page back through a PDF-Tools OCR with the "Low" accuracy setting a 2nd time. I still could not search for and find the "1516-A" precinct number.

Then I tried to take open the unsuccessful OCR attempted document in "PDF=XChange Editor" and repeat the once before successful OCR. This failed in the second try to locate and allow searching for and find any digit in "1516-A".

Bottom line now, I can find the precinct on one of the attached versions of the same PDF document but not the other one. I just don't know how to make sure I get the searchable result every single time.

I'm going to go back and repeat the batch using the "Low" accuracy OCR setting on all 564 pages and see if it misses the exact same and only page on the first batch attempt.

NoteApplicabble
User
Posts: 9
Joined: Sat Oct 02, 2021 8:52 pm

Re: Attempting to use OCR to allow Sorting pages Using Windows Searching Contents, OCR Number Accuracy... Recommendation

Post by NoteApplicabble » Wed Oct 13, 2021 10:56 pm

Going back and running the PDF file that was previously not searchable for the precinct using the EOCR in the 564 page batch through a separate batch using the Default OCR... I was able to search for and find the "1516-A" precinct number both in the PDF-XChange Editor and Windows.

I guess I need to run back through and see if the Default OCR is better for all 600dpi pages or just the pages with text that was missed by the EOCR. The batch that used the EOCR was taking under 30 minutes. I'll see how long and how successful the Default OCR is on the 564 pages on Low accuracy.

I'm attaching the PDF from the 2nd attempt to OCR it via the Default OCR.
COVID STE BATCH 2__153__(OCR).pdf
(1.91 MiB) Downloaded 3 times

User avatar
Vasyl-Tracker Dev Team
Site Admin
Posts: 2081
Joined: Thu Jun 30, 2005 4:11 pm
Location: Canada

Re: Attempting to use OCR to allow Sorting pages Using Windows Searching Contents, OCR Number Accuracy... Recommendation

Post by Vasyl-Tracker Dev Team » Thu Oct 21, 2021 12:43 am

Hi NoteApplicabble.

We reproduced your issue (redundant whitespaces in numbers) with EOCR.Accuracy=High and will investigate it...

Cheers.
Vasyl Yaremyn
Tracker Software Products
Project Developer

Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.

Post Reply