Research Paper:
Handwritten Character String Recognition Using a String Recognition Transformer
Shunya Rakuka, Kento Morita
, and Tetsushi Wakabayashi
Graduate School of Engineering, Mie University
1577 Kurimamachiya-cho, Tsu, Mie 514-8507, Japan
Corresponding author
Improving the accuracy of handwritten character string recognition allows handwritten documents to be converted into digital text. This facilitates camera-based text input, enabling robotic process automation to manage documentation tasks. Although this field has seen significant progress, recognizing handwritten Japanese remains particularly challenging due to the difficulty of character segmentation, the wide variety of character types, and the absence of clear word boundaries. These factors make unconstrained handwritten Japanese string recognition particularly difficult for conventional approaches. Moreover, transformer-based models typically require large amounts of annotated training data. This study proposes and investigates a new String Recognition Transformer (SRT) model capable of recognizing unconstrained handwritten Japanese character strings without relying on explicit character segmentation or a large number of training images. The SRT model integrates a convolutional neural network backbone for robust local feature extraction, a Transformer encoder-decoder architecture, and a sliding window strategy that generates overlapping patches. Comparative experiments show that our method achieved a character error rate (CER) of 0.067, significantly outperforming convolutional recurrent neural network, transformer-based optical character recognition, and handwritten text recognition with Vision Transformer which achieved CERs of 0.664, 0.165, and 0.106, respectively, thereby confirming the effectiveness and robustness of the approach.
- [1] S. Rakuka, K. Morita, and T. Wakabayashi, “Handwritten character string recognition using transformer and CNN features,” Proc. of 2024 Joint 13th Int. Conf. on Soft Computing and Intelligent Systems and 25th Int. Symp. on Advanced Intelligent Systems (SCIS&ISIS), 2024. https://doi.org/10.1109/SCISISIS61014.2024.10759989
- [2] C. Bartz, H. Yang, and C. Meinel, “STN-OCR: A single neural network for text detection and text recognition,” arXiv:1707.08831, 2017. https://doi.org/10.48550/arXiv.1707.08831
- [3] B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai, “Aster: An attentional scene text recognizer with flexible rectification,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.41, No.9, pp. 2035-2048, 2018. https://doi.org/10.1109/TPAMI.2018.2848939
- [4] B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.39, No.11, pp. 2298-2304, 2016. https://doi.org/10.1109/TPAMI.2016.2646371
- [5] B. Shi, X. Wang, P. Lyu, C. Yao, and X. Bai, “Robust scene text recognition with automatic rectification,” Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 4168-4176, 2016. https://doi.org/10.1109/CVPR.2016.452
- [6] C. Luo, L. Jin, and Z. Sun, “Moran: A multi-object rectified attention network for scene text recognition,” Pattern Recognition, Vol.90, pp. 109-118, 2019. https://doi.org/10.1016/j.patcog.2019.01.020
- [7] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems (NIPS 2017), Vol.30, 2017.
- [8] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv:2010.11929, 2020. https://doi.org/10.48550/arXiv.2010.11929
- [9] M. Li, T. Lv, J. Chen, L. Cui, Y. Lu, D. Florencio, C. Zhang, Z. Li, and F. Wei, “Trocr: Transformer-based optical character recognition with pre-trained models,” Proc. of the AAAI Conf. on Artificial Intelligence, Vol.37, No.11, pp. 13094-13102, 2023. https://doi.org/10.1609/aaai.v37i11.26538
- [10] M. Fujitake, “Dtrocr: Decoder-only transformer for optical character recognition,” Proc. of the IEEE/CVF Winter Conf. on Applications of Computer Vision, pp. 8025-8035, 2024. https://doi.org/10.1109/WACV57701.2024.00784
- [11] Y. Li, D. Chen, T. Tang, and X. Shen, “Htr-vt: Handwritten text recognition with vision transformer,” Pattern Recognition, Vol.158, Article No.110967, 2025. https://doi.org/10.1016/j.patcog.2024.110967
- [12] G. Kim, T. Hong, M. Yim, J. Nam, J. Park, J. Yim, W. Hwang, S. Yun, D. Han, and S. Park, “Ocr-free document understanding transformer,” European Conf. on Computer Vision, pp. 498-517, 2022. https://doi.org/10.1007/978-3-031-19815-1_29
- [13] F. Sheng, Z. Chen, and B. Xu, “Nrtr: A no-recurrence sequence-to-sequence model for scene text recognition,” 2019 Int. Conf. on Document Analysis and Recognition (ICDAR), pp. 781-786, 2019. https://doi.org/10.1109/ICDAR.2019.00130
- [14] J. Li, Y. Xu, T. Lv, L. Cui, C. Zhang, and F. Wei, “Dit: Self-supervised pre-training for document image transformer,” Proc. of the 30th ACM Int. Conf. on Multimedia, pp. 3530-3539, 2022. https://doi.org/10.1145/3503161.3547911
- [15] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” Advances in Neural Information Processing Systems, Vol.27, 2014.
- [16] Y. Baek, B. Lee, D. Han, S. Yun, and H. Lee, “Character region awareness for text detection,” Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 9365-9374, 2019. https://doi.org/10.1109/CVPR.2019.00959
This article is published under a Creative Commons Attribution-NoDerivatives 4.0 Internationa License.