If the OCR target is limited to numbers, Japanese (Modern) is recommended as the OCR language.

rakunavi · Post by **rakunavi** » Fri Jun 10, 2022 10:36 pm

Hello all,

I found a tip might be useful for PDF-XChange Editor Plus users to know, so I will share.

I frequently scan and capture books for digitization, trimming only the page numbers, using OCR to recognize the text, and using my custom script to check for page omissions and duplicates. This work is done by using OCR to recognize the characters and to check for page omissions or duplications. This process is based on the premise that the character recognition by OCR is absolutely reliable. However, I have found that when I specify a Western language such as English, German, or French as the OCR language for EOCR (FineReader), there are significant omissions, especially for single-digit numbers. In this case, when Japanese is specified as the OCR language, it is recognized correctly without any omissions.

In this forum, I have often reported issues that occurs with Japanese while it works correctly with Western languages, but this is a rare case of the opposite pattern.

I believe that the idea of specifying Japanese is not likely to occur to Westerners. Incidentally, if you use the normal Tesseract OCR engine, which is not EOCR, it will works correctly even if you specify English as the OCR language. However, the Tesseract OCR engine is so slow that it is not practical.

Since the recognition rate varies depending on the font used, font size, trimming size, etc., a sample file is attached as an example.

SampleFiles.zip: (78.33 KiB) Downloaded 107 times

A blank PDF file of 100 pages in A4 size was created.
Page numbers are specified at the top center of the header. The font is Times New Roman, the size is 10.0 pt, and the margins are 12.7 mm both top, bottom, left and right.
Rasterized in monochrome, 600 dpi, CCITT Group 4.

Sample.pdf in the attached file was created under these very simple conditions. OCR results are EOCR_English.pdf and EOCR_English.txt with English as the EOCR language. Similarly, EOCR_JapaneseModern.pdf and EOCR_JapaneseModern.txt with Japanese (Modern) as the EOCR language. In both cases, EOCR accuracy was set to High, and the "Detect skey of page content" option was disabled.

The results show that Japanese is 100% accurate and English has 9 omissions and 5 misrecognitions, as shown in the page numbers below.

Page number where the omission occurred :
1, 5, 7, 8, 9, 11, 15, 31, 51
Page number where false recognition occurred :
22, 37, 67, 73, 76

The false recognition is still better because we can recognize that there is something there, but the omissions are absolutely unacceptable for the application as I mentioned at the beginning of this document. Recall that the original data is not a scanned paper document, but rasterized digital data at 600 dpi in black and white. It was done under conditions extremely favorable to the OCR engine.

As I mentioned above, if your OCR target is limited to numbers, I recommend you to set the OCR language to Japanese (Modern). For your information, there are two OCR languages: Japanese and Japanese (Modern). As I have posted in this forum before, I recommend Japanese (Modern), which has an improved dictionary.

https://forum.pdf-xchange.com/viewtopic.php?p=158219#p158219

Hoping that the above information will be of some help to you.

Best regards,
rakunavi

- PDF-XChange Editor Plus Version:9.3 build 361.0
- OS Version: Windows 10 Home/Pro 21H2 Build 19044.1706
- PC Model: Lenovo IdeaPad C340-15IWL / HP ProDesk 600G1

Post by **TrackerSupp-Daniel** » Mon Jun 13, 2022 5:06 pm

Hello, rakunavi

Thank you for the information! OCR engines are not 100% reliable, and when it comes to items that look similar to other characters in the language set, it is not uncommon for them to be confused. This is an area that is improving over time, but for now, I hope that you suggestions to use Japanese helps someone who needs this reliability.

Kind regards,

If the OCR target is limited to numbers, Japanese (Modern) is recommended as the OCR language.

If the OCR target is limited to numbers, Japanese (Modern) is recommended as the OCR language.

Re: If the OCR target is limited to numbers, Japanese (Modern) is recommended as the OCR language.