Printed Japanese Character Recognition Using Multiple Commercial OCRs
Hidetoshi Miyao*, Yasuaki Nakano**, Atsuhiko Tani***, Hirosato Tabaru****, and Toshihiro Hananoi**
*Shinshu University, 4-17-1, Wakasato, Nagano 380-8553, Japan
**Kyushu Sangyo University, 2-3-1 Matsukadai, Higashi-ku, Fukuoka 813-8503, Japan
***Hitachi Software Engineering Co., Ltd., 5030 Totsuka-cho, Totsuka-ku, Yokohama 244-8555, Japan
****Fuji Xerox Co., Ltd., 2-3-1 Matsukadai, Higashi-ku, Fukuoka 813-8503, Japan
This paper proposes two algorithms for maintaining matching between lines and characters in text documents output by multiple commercial optical character readers (OCRs). (1) a line matching algorithm using dynamic programming (DP) matching and (2) a character matching algorithm using character string division and standard character strings. The paper proposes a method that introduces majority logic and reject processing in character recognition. To demonstrate the feasibility of the method, we conducted experiments on line matching recognition for 127 document images using five commercial OCRs. Results demonstrated that the method extracted character areas with more accuracy than a single OCR along with appropriate line matching. The proposed method enhanced recognition from 97.61% provided by a single OCR to 98.83% in experiments using the character matching algorithm and character recognition. This method is expected to be highly useful in correcting locations at which unwanted lines or characters occur or required lines or characters disappear.
This article is published under a Creative Commons Attribution-NoDerivatives 4.0 Internationa License.