I found a tip might be useful for PDF-XChange Editor Plus users to know, so I will share.
I frequently scan and capture books for digitization, trimming only the page numbers, using OCR to recognize the text, and using my custom script to check for page omissions and duplicates. This work is done by using OCR to recognize the characters and to check for page omissions or duplications. This process is based on the premise that the character recognition by OCR is absolutely reliable. However, I have found that when I specify a Western language such as English, German, or French as the OCR language for EOCR (FineReader), there are significant omissions, especially for single-digit numbers. In this case, when Japanese is specified as the OCR language, it is recognized correctly without any omissions.
In this forum, I have often reported issues that occurs with Japanese while it works correctly with Western languages, but this is a rare case of the opposite pattern. I believe that the idea of specifying Japanese is not likely to occur to Westerners. Incidentally, if you use the normal Tesseract OCR engine, which is not EOCR, it will works correctly even if you specify English as the OCR language. However, the Tesseract OCR engine is so slow that it is not practical.
Since the recognition rate varies depending on the font used, font size, trimming size, etc., a sample file is attached as an example.
- A blank PDF file of 100 pages in A4 size was created.
- Page numbers are specified at the top center of the header. The font is Times New Roman, the size is 10.0 pt, and the margins are 12.7 mm both top, bottom, left and right.
- Rasterized in monochrome, 600 dpi, CCITT Group 4.
The results show that Japanese is 100% accurate and English has 9 omissions and 5 misrecognitions, as shown in the page numbers below.
- Page number where the omission occurred :
1, 5, 7, 8, 9, 11, 15, 31, 51 - Page number where false recognition occurred :
22, 37, 67, 73, 76
As I mentioned above, if your OCR target is limited to numbers, I recommend you to set the OCR language to Japanese (Modern). For your information, there are two OCR languages: Japanese and Japanese (Modern). As I have posted in this forum before, I recommend Japanese (Modern), which has an improved dictionary.
https://forum.pdf-xchange.com/viewtopic.php?p=158219#p158219
Hoping that the above information will be of some help to you.
Best regards,
rakunavi
- PDF-XChange Editor Plus Version:9.3 build 361.0
- OS Version: Windows 10 Home/Pro 21H2 Build 19044.1706
- PC Model: Lenovo IdeaPad C340-15IWL / HP ProDesk 600G1