Attempting to use OCR to allow Sorting pages Using Windows Searching Contents, OCR Number Accuracy... Recommendations?

NoteApplicabble · Post by **NoteApplicabble** » Tue Oct 12, 2021 8:18 pm

I am trying to digitally sort Ballot images by Precinct from locations that had voters from multiple precincts voting at a single location. (Early voting, Absentee Ballots, where they were all mixed up. Election ballot images were not saved during the election. And in re-scanning pages, there is no presorting happening.

I'm hoping someone might have a recommendation on how to tighten up the accuracy.

Steps I'm attempting to use PDF-Tools to facilitate this process:

Split 100-400 page batches of ballot scans into individual pages, place them into a single folder,
Add OCR to individual pages after having been split,
After OCR has been applied I want to be able with 100% accuracy to search the contents of PDF documents within Windows 10 for a precinct number & ballot variation.
(Example is searching: "1615-A" 1615 is the precinct number in the top right corner of the ballot and A is the variation of the ballot items for different voters within that one precinct.)

By doing this I want to find every ballot page image for that precinct to be able to move it into an individual Windows folder with contents limited to only that precinct.

I am having about 1% of the ballots being missed because the OCR process is misreading, etc. the Precinct number and Ballot Variation or other applications can't seem to search the document at all.

Some Ballot variation are being interpreted like the following. Instead of: "2117-A" it is creates as: "2 117 -A"; & "1 4 1 9 -A" instead of "1419-A"

Some documents are searchable in PDF-XChange Editor but not in Adobe Acrobat Reader DC. I wonder if the same trouble Adobe is having trouble searching the document is the same reason Windows 10 can't find the contents of the document either.

I'm currently testing with 564 pages and I'm getting about 4 ballots each time that don't get identified out of these three large batches I'm testing with.
(Not a big deal? It is if I am trying to get the kinks worked out in preparation to handle +/-200,000 ballots, approx 4 pages each.)

COVID STE BATCH 2_203_(OCR).pdf: (165.67 KiB) Downloaded 69 times

COVID STE BATCH 1_151_(OCR).pdf: (87.13 KiB) Downloaded 65 times

Post by **TrackerSupp-Daniel** » Tue Oct 12, 2021 9:57 pm

Hello, NoteApplicabble

First, I should note that the quality of these scans is quite sub-par, and so any OCR engine will struggle to scan all the contents properly, though the titles should be coming out just fine. If at all possible, you should try to get these re-scanned at a higher quality, as that will be the most easily noticeable improvement to the OCR output.

Second, Choosing "Accuracy: High" is not what should be done here. The Accuracy setting is a definition of the document quality, so "high" accuracy means we will assume that the file is pristine, with no blemishes, and clear characters to easily identify (not anti-aliasing or other image effects, nine times out of ten, this means an image which has never left the digital format). You should be using "Auto" for these documents, or if that does not work, possibly even "Low" accuracy to see which gives the best results.

Third, regarding search after OCR, you will want to double check that our "search" handler is enabled, by selecting our "IFilter handler" in our shell extensions setup utility, as detailed here, and then restarting the PC.
Note that in some rare cases, windows indexing can interfere with Ifilter and prevent us from utilizing our features. If you find that you cannot search even after doing this, please disable windows indexing for the folder you are searching, and see if that helps at all.

Kind regards,

NoteApplicabble · Post by **NoteApplicabble** » Wed Oct 13, 2021 4:55 pm

Getting the same issue with original 600dpi resolution. See attached.

Batched that through last night and I have (13) pages where the "Enhanced (FineReader)" EOCR did not identify "Precinct" or a single individual number included in the precinct number, the "-" or "A" at the end of them.
The OCR "Recognition Options" "Accuracy" setting was set to "High".

Taking the PDF to TXT option, I can't find any of the above listed characters of the Precinct numbers recording the precinct number or ballot variation "A" either.
(Can't upload .txt files to show that too reflects not picking up the ballot precinct number or ballot identifier letter.)

In part, I'm getting better results (no phantom spaces etc. getting inserted into the middle of my precinct numbers) but in other respects I'm getting worse results in that it is completely missing the word "Precinct" or any of the (4) digit precinct # digits or the ballot variation "-(letter)"

Any help or advice would be appreciated from community or developers, etc.

COVID STE BATCH 1_29_(OCR).pdf: (1.43 MiB) Downloaded 60 times

NoteApplicabble · Post by **NoteApplicabble** » Wed Oct 13, 2021 9:39 pm

TrackerSupp-Daniel wrote: ↑Tue Oct 12, 2021 9:57 pm
First, I should note that the quality of these scans is quite sub-par, and so any OCR engine will struggle to scan all the contents properly, though the titles should be coming out just fine...

Second, Choosing "Accuracy: High" is not what should be done here...

Third, regarding search after OCR, you will want to double check that our "search" handler is enabled, by selecting our "IFilter handler" in our shell extensions setup utility, as detailed "... URL...", and then restarting the PC...
Note that in some rare cases, windows indexing can interfere with Ifilter and prevent us from utilizing our features. If you find that you cannot search even after doing this, please disable windows indexing for the folder you are searching, and see if that helps at all.

Testing now with 600dpi
Essentially the consistency is increased. I now reliably get issues in 13 of the 564 pages no matter what the OCR settings I've tried. My first test with Accuracy left at High for 600dpi scans (against "TrackerSupp-Daniel" advice) gave similar problem finding text.

Tested the same 600dpi scans using the OCR Accuracy at the Auto setting. I still get (13 files) that don't include the precinct number or ballot variation letter in OCR and therefore can't be searched for.

COVID STE BATCH 1__65__(OCR).pdf: (1.43 MiB) Downloaded 63 times

I'm attaching a few of the pages that don't locate anything when searching for the even a single digit of the precinct number.[/b]

I've migrated over to demoing File Juggler ... for my file sorting efforts if I can get OCR to detect all of the Precinct number and ballot variation letters. (But I did run the "XCShInfoSetup.exe" utility and chose "Select All". I did not see a drop down or box for "Search" handler specifically. They all had other names.

(I need to find a thread around hear where someone wanted to know if PDF-Tools etc. could search contents of files and sort for them in case the File Juggler might be a temporary useful companion to help them while that is not an include feature in Tracker Software that is offered to date.)

NoteApplicabble · Post by **NoteApplicabble** » Wed Oct 13, 2021 10:31 pm

Using 600dpi PDF pages... and putting the OCR Accuracy to "Low" it only missed one page in the search. I'm attaching that page.

COVID STE BATCH 2__153__(OCR).pdf: (1.9 MiB) Downloaded 61 times

If I open that file in "PDF-XChange Editor" and "Convert" with "OCR Page(s)" and used the "Low" accuracy setting. I can search the "1516-A" and locate the precinct number and the ballot variation letter.

COVID STE BATCH 2__153__(OCR) Searchable.pdf: (1.9 MiB) Downloaded 60 times

When I experimented running the previously missed and attached PDF page back through a PDF-Tools OCR with the "Low" accuracy setting a 2nd time. I still could not search for and find the "1516-A" precinct number.

Then I tried to take open the unsuccessful OCR attempted document in "PDF=XChange Editor" and repeat the once before successful OCR. This failed in the second try to locate and allow searching for and find any digit in "1516-A".

Bottom line now, I can find the precinct on one of the attached versions of the same PDF document but not the other one. I just don't know how to make sure I get the searchable result every single time.

I'm going to go back and repeat the batch using the "Low" accuracy OCR setting on all 564 pages and see if it misses the exact same and only page on the first batch attempt.

NoteApplicabble · Post by **NoteApplicabble** » Wed Oct 13, 2021 10:56 pm

Going back and running the PDF file that was previously not searchable for the precinct using the EOCR in the 564 page batch through a separate batch using the Default OCR... I was able to search for and find the "1516-A" precinct number both in the PDF-XChange Editor and Windows.

I guess I need to run back through and see if the Default OCR is better for all 600dpi pages or just the pages with text that was missed by the EOCR. The batch that used the EOCR was taking under 30 minutes. I'll see how long and how successful the Default OCR is on the 564 pages on Low accuracy.

I'm attaching the PDF from the 2nd attempt to OCR it via the Default OCR.

COVID STE BATCH 2__153__(OCR).pdf: (1.91 MiB) Downloaded 60 times

UPDATE: (2021-11-06)
I found that batching the documents that did not EOCR my precinct numbers and ballot variations on the Low - Accuracy setting. could be EOCR the remaining precinct numbers and ballot variations on the Auto - Accuracy setting. (I did not need to jump to the Default OCR.)
I also verified that if had fewer failures identifying the precinct number, etc. if I ran the pages through first on the Low rather than first on the Auto Accuracy option.

Thu Oct 21, 2021 12:43 am

Hi NoteApplicabble.

We reproduced your issue (redundant whitespaces in numbers) with EOCR.Accuracy=High and will investigate it...

Cheers.

Jensen Head · Post by **Jensen Head** » Wed Nov 03, 2021 11:47 am

TrackerSupp-Daniel wrote: ↑Tue Oct 12, 2021 9:57 pmThe Accuracy setting is a definition of the document quality, so "high" accuracy means we will assume that the file is pristine, with no blemishes, and clear characters to easily identify

TrackerSupp-Daniel wrote: ↑Mon Nov 01, 2021 4:33 pmOur OCR engine always processes at the same "output quality" the "accuracy" setting you see in these options is more so related to the quality of the current document Before OCR. If your document is already pristine and has never left the digital format, using Auto or "high" accuracy may be ideal. In most other cases, auto or "medium" should be used. In cases where the document is poor quality, or was scanned at a low DPI settings, then either Auto or "low" accuracy setting should be used.

If I hadn't accidentally stumbled upon several similar explanations in various topics of this forum, I would never have guessed that the "Accuracy" setting with the possible values "Auto", "Low", "Medium" and "High" is not a characteristic of the recognition algorithm and its speed to the detriment of quality, but is an assessment by the user of the recognizable material.
This is not at all obvious to those who previously used only Abbyy products for recognition. At least in ABBYY FineReader PDF 15 OCR Editor, this setting is called "OCR speed and accuracy" and has options for "Thorough recognition" and "Fast recognition". And this formulation, it seems to me, is much more successful.

Post by **Paul - Tracker Supp** » Wed Nov 03, 2021 2:47 pm

Hi Jensen Head,

I am going to go out on a limb here and say I agree with you that the clarity here could be improved. We will discuss that internally.

As always, thanks for the thoughtful and pertinent feedback.

NoteApplicabble · Post by **NoteApplicabble** » Sun Nov 07, 2021 12:53 pm

A quick Update.

When I was starting out testing using 100dpi PDF pages and trying to search their contents for the Precinct number and ballot variation letter and having trouble with consistently finding them I was using compressed files - not the original source files. The original source files were 600dpi.

The program I used to compress my original test 7,402 pages of 100dpi PDF files was Libre Office. The result was spaces added where there were no spaces, completely missing the word "Precinct" or the entire precinct number or ballot variation letter.

When I was going back to write down the failure percentage rate for being able to identify the precinct number using EOCR for 100dpi vs. 600dpi vs. 300dpi I was shocked that there was a significant drop in failures when I OCR'd 300dpi pages after I "Recompress Images" from the original scanned 600dpi images. no matter whether I'd been using 100dpi or 600dpi pages, I'd always had fewer failures using the Low Accuracy setting for EOCR.

This led me to want to retest 100dpi pages.
If I "Recompress Images" in PDF-Tools instead of Libre Office and chose the "Jpeg quality" as "Maximum" and downsampling "Bilinear" to 100dpi for images above 120dpi I had no failures identifying a single precinct number or ballot variation letter on the first attempt. That was a 100% success rate.

It seems that not all compressed images give the same result when processed by EOCR. The Libre Office did tend to have lots of little gray specks around characters that were not present when I batched them in PDF-Tools with the settings I laid out above.

I'm wondering if this is a typical software correlation. Staying within a single software product line will work best since bugs will likely be discovered in their testing of their own software product and fix it... but not detect issues with documents processed partially by other software manufacturers.

Hmmm...

Currently I'm thankful I've found a setup that is working 100% of the time with the original 7,402 pages of 600dpi PDF documents I received.
The down side is that the process that is known to work requires slower scanning at 600dpi at the county followed up by the extra step of Recompressing Images in PDF-Tools. All of this adds time to the process.

I don't know that if I ask the county to batch to scan the documents at 100dpi if they will EOCR with or without failures.
(The county gave me one batch of 294 pages at 100dpi. I could EOCR those pages 100% of the time even though these scans had specks around the letter characters like the compressed documents in Libre Office that now appear to be though source of much of my original problems. I doubt that the county would even consider scanning the documents again at the lower 100dpi to see if PDF-Tools works successfully if they are the source of documents compression / creation at 100dpi).

These are the latest observations I've made in this digital ballot sorting effort.

Mon Nov 08, 2021 10:50 am

Hello NoteApplicabble,

I believe the issue here is the image downsampling/compression algorithms used by Libre Office. JPEG is a lossy format, so noise is introduced to the images when processed, and the algorithms used by Libre Office are apparently adding lots of it that then confuses our OCR engine.

We and probably Libre Office are using third party libraries for such image manipulations, and it would appear the one they chose does a worse job than ours (wink

). Our EOCR engine is based on ABBYY's Fine Reader - so that is also in a way an external tool, so it has nothing to do with "keeping operations in the same environment" or deliberately giving worse results when a third party tool was used (we don't really detect who and how compressed an image before 'feeding' it to the EOCR engine for processing).

So there's nothing sinister happening here, and it's just the selection of tools and libraries and the combination that you encountered that gave better or worse results!

As for the scanning at the county office - 300 DPI scanning is going to be 2-4 times faster than 600 DPI, and will generate significantly smaller files, what would still be of sufficient quality for you to process and OCR. Also even if there is 'noise' in the scans - it will be significantly less than in 100DPI scans, again ensuring you get better final results.
Then - converting from 300 to 100 DPI might not be necessary and you could directly run the OCR, or even if you still need to down sample to 100 DPI - that process will again be much faster than from 600.

Kind regards,
Stefan

NoteApplicabble · Post by **NoteApplicabble** » Thu Dec 30, 2021 4:42 pm

I now have received different ballot images from Ada County here in Idaho for the entire November 2, 2021 election. I'm trying to sort these ballots into individual precinct folders. These ballot images were created by Ada county's election equipment.

After I applied OCR using EOCR I am able to open to search and find the work "Precinct" in every single PDF using the File Juggler app or Windows 10 "File Explorer". This is the good news. The bad news is that Windows can't search and find using the advanced option enabling finding "File contents" to identify the precinct number following the word "Precinct". This is happening on 3.6% of the PDF ballot images.

I'll attach a few samples of the PDF's that are giving me trouble. (If I open the PDF in any PDF reader application or PDF-XChange Editor for example, I can search for and find "Precinct 1601" exactly as it should be able to find it. This indicates that PDF-Tools correctly interpreted the text in the image, added the correct text in the invisible layer added to the PDF Searchable Image and saved the PDF sufficiently. But if I search a folder having enabled the search "File Contents" option. It can't ever find the "Precinct 1601" file content. This happens like I said in 3.6% of my ballot images from this new batch of ballots. I'm currently working on trying to sort 26,696 ballot images based on content.

The odd thing is that in the same run of the PDF-Tools batch process I instructed PDf-Tools to create a PDF to TXT step and it created .txt document reflecting the content of the same PDF images. These TXT documents thus far have been able to search for and find the Precinct numbers 100% of the time - I'm half way through the process of testing this theory on all 26,696 documents and have not seen it fail once yet in my searching of groups of 900 ballots at a time. Doing the same sorting, but with the PDF documents created by PDF-Tools I typically got 30-60 documents where windows could not find the know file contents when searched for.

Any suggestions or insights into this lack of consistency / lack of success finding the PDF file contents when the TXT documents based on the PDF contents appear to prove the contents exist, but windows 10 can't find the contents?

0b1b8feb-7500-4890-870e-f64ce2aeecd3_Back__OCR.pdf: (27.9 KiB) Downloaded 51 times

Zip of 3 TXT from PDF Created Doc's.zip: (3.26 KiB) Downloaded 48 times

Post by **Tracker Supp-Stefan** » Fri Dec 31, 2021 3:54 pm

Hello NoteApplicabble,

Thanks for the sample files.
I had no issues using the Windows Search to find the 1601 in your sample file:

Can you please try to repeat your search with only that file - and see if you are able to locate it? Maybe it is the Windows Search at your end that is not 100% consistent?

Are you running Windows 10 or 11?

Kind regards,
Stefan

NoteApplicabble · Post by **NoteApplicabble** » Fri Dec 31, 2021 7:22 pm

Windows 10

I'm going to attempt to move my license to a different computer and see if I have trouble there.

In all of my searches recently I search for "Precinct 1601" to try to avoid the numbers 1601 showing up somewhere other than identifying a precinct number.

Post by **TrackerSupp-Daniel** » Fri Dec 31, 2021 8:06 pm

Hello, NoteApplicabble

You should not need to move your license to a new device to use the search functions (as searching is free), you would only need to install the software. I am curious to know what happens on the other device though, like stefan I do not have any issues searching for the word "Precinct", the number "1601", or the phrase "Precinct 1601" on this end. Perhaps if the issue can still be reproduced on the new device, could I ask you to record a video of the issue in action, perhaps we are missing a step in our tests.

Kind regards,

NoteApplicabble · Post by **NoteApplicabble** » Mon Jan 10, 2022 12:05 am

I have been assuming in all of my tests that the EOCR was supposed to be the better of the two OCR engines.
Is this true in any respect?

I'm wondering now if the slower Default could be any better (though I have not tested it as thoroghly) for the extra wait time. I've gotten some goofy errors with the EOCR like OCR text that is all gibberish, but running back the exact same files a 2nd time (with fewer files in the batch) often gives me an expected OCR readable text. Often I'll have remaining issues on some. Then I'll go back and use the Default OCR engine on the fewer that still have issues after multiple OCR attempts with EOCR and often get good results with the Default OCR engine engaged.
Any thoughts on this either way?

Mon Jan 10, 2022 10:31 am

Hello NoteApplicabble,

The EOCR can utilize more than one thread/core of your CPU - so that is the main reason it is faster.
The truth is that the Standard and Enhanced OCR tools use different engines, and they could produce different results with some sample files. If you do not get good results with the EOCR consistently on a file, but manage to process the same with the Standard OCR - can you please send us such a sample document (via e-mail so that it's not posted here in the forums if confidential), and the exact settings you use to process it - so that we can also run some tests on it at our side?

Kind regards,
Stefan

NoteApplicabble · Post by **NoteApplicabble** » Tue Jan 11, 2022 11:38 pm

Tracker Supp-Stefan wrote: ↑Mon Jan 10, 2022 10:31 am If you do not get good results with the EOCR consistently on a file, but manage to process the same with the Standard OCR - can you please send us such a sample document...

I think I may have been reminded that on further review the gibberish was triggered by bad PNG to PDF conversion at the beginning which followed being processed by EOCR (Low-Accuracy) followed by EOCr (Auto-Accuracy) to add a second layer hoping to catch a few more Precinct numbers on the pages.
Essentially the various tools I'm sending the PNG images through in PDF-Tools is creating (3) different PDF versions of the original PNG. I only intended and have been trying to sort are the last one of the (3) PDF files for each PNG which should contain (2) EOCR layers since my ability to successfully read and then sort the precinct numbers via the Searchable Image created varies with the Accuracy setting for the OCR engine, etc. so I'm doing both. I would like to process once with EOCR and then the 2nd time with Default for using different engines, but I can't figure out how to do this in one pass.

I don't believe I have any of the Gibberish files now - I replaced them.

I believe I replaced all of the unhelpful EOCR engine PDF files when the Default OCR engine made successful results. But as I just mentioned, I may have been blaming the EOCR engine for bad PNG to PDF conversion steps.

I'll definitely saves some specific files when I can specify what steps created the defective outcomes.

I will say that I often get errors on some files the first time processing them, but upon running the error files from scratch through exact same tools and settings PDF-Tools reports all as processed successfully.

I attached a few screen shots of annoying and common errors that occur in most large batches, but if repeated on the same files a 2nd time they work perfectly. It doesn't seem to be an file based, settings based, but are inconsistent processing errors I can't prevent. I just have to manually isolate the failed files and process these again with the exact same settings again.
I don't expect any cause to be determined from these screenshots, but just posting where I'm seeing the Errors listed / identified.
png[/attachment].

Wed Jan 12, 2022 11:09 am

Hello NoteApplicabble,

Glad to hear that you've figured out what is causing the issues. Getting two layers of OCR text in the same file can indeed be a bit confusing, and the initial image quality (and it's conversion to PDF) are very important.

As for the errors - it seems like your folder is on a network storage. Getting 3 errors out of 133K files over 8 hours is a low enough error rate and might be caused by a myriad of factors e.g.: connection lost for just a fraction of a second, file being inaccessible because another process locks it (Unlikely but possible), etc. so just redoing those 3 that failed in a separate batch should be the quickest way to complete the whole batch.

Kind regards,
Stefan

Attempting to use OCR to allow Sorting pages Using Windows Searching Contents, OCR Number Accuracy... Recommendations?

Attempting to use OCR to allow Sorting pages Using Windows Searching Contents, OCR Number Accuracy... Recommendations?

Re: Attempting to use OCR to allow Sorting pages Using Windows Searching Contents, OCR Number Accuracy... Recommendation

Re: Attempting to use OCR to allow Sorting pages Using Windows Searching Contents, OCR Number Accuracy... Recommendation

Re: Attempting to use OCR to allow Sorting pages Using Windows Searching Contents, OCR Number Accuracy... Recommendation

Re: Attempting to use OCR to allow Sorting pages Using Windows Searching Contents, OCR Number Accuracy... Recommendation

Re: Attempting to use OCR to allow Sorting pages Using Windows Searching Contents, OCR Number Accuracy... Recommendation

Re: Attempting to use OCR to allow Sorting pages Using Windows Searching Contents, OCR Number Accuracy... Recommendation

Re: Attempting to use OCR to allow Sorting pages Using Windows Searching Contents, OCR Number Accuracy... Recommendation

Re: Attempting to use OCR to allow Sorting pages Using Windows Searching Contents, OCR Number Accuracy... Recommendation

Re: Attempting to use OCR to allow Sorting pages Using Windows Searching Contents, OCR Number Accuracy... Recommendation

Re: Attempting to use OCR to allow Sorting pages Using Windows Searching Contents, OCR Number Accuracy... Recommendation

Re: Attempting to use OCR to allow Sorting pages Using Windows Searching Contents, OCR Number Accuracy... Recommendation

Re: Attempting to use OCR to allow Sorting pages Using Windows Searching Contents, OCR Number Accuracy... Recommendation

Re: Attempting to use OCR to allow Sorting pages Using Windows Searching Contents, OCR Number Accuracy... Recommendation

Re: Attempting to use OCR to allow Sorting pages Using Windows Searching Contents, OCR Number Accuracy... Recommendation

Re: Attempting to use OCR to allow Sorting pages Using Windows Searching Contents, OCR Number Accuracy... Recommendation

Re: Attempting to use OCR to allow Sorting pages Using Windows Searching Contents, OCR Number Accuracy... Recommendation

Re: Attempting to use OCR to allow Sorting pages Using Windows Searching Contents, OCR Number Accuracy... Recommendation

Re: Attempting to use OCR to allow Sorting pages Using Windows Searching Contents, OCR Number Accuracy... Recommendation