single-jc.php

JACIII Vol.30 No.1 pp. 15-23
doi: 10.20965/jaciii.2026.p0015
(2026)

Research Paper:

Handwritten Character String Recognition Using a String Recognition Transformer

Shunya Rakuka, Kento Morita ORCID Icon, and Tetsushi Wakabayashi

Graduate School of Engineering, Mie University
1577 Kurimamachiya-cho, Tsu, Mie 514-8507, Japan

Corresponding author

Received:
May 19, 2025
Accepted:
June 25, 2025
Published:
January 20, 2026
Keywords:
handwritten recognition, character string recognition, Transformer
Abstract

Improving the accuracy of handwritten character string recognition allows handwritten documents to be converted into digital text. This facilitates camera-based text input, enabling robotic process automation to manage documentation tasks. Although this field has seen significant progress, recognizing handwritten Japanese remains particularly challenging due to the difficulty of character segmentation, the wide variety of character types, and the absence of clear word boundaries. These factors make unconstrained handwritten Japanese string recognition particularly difficult for conventional approaches. Moreover, transformer-based models typically require large amounts of annotated training data. This study proposes and investigates a new String Recognition Transformer (SRT) model capable of recognizing unconstrained handwritten Japanese character strings without relying on explicit character segmentation or a large number of training images. The SRT model integrates a convolutional neural network backbone for robust local feature extraction, a Transformer encoder-decoder architecture, and a sliding window strategy that generates overlapping patches. Comparative experiments show that our method achieved a character error rate (CER) of 0.067, significantly outperforming convolutional recurrent neural network, transformer-based optical character recognition, and handwritten text recognition with Vision Transformer which achieved CERs of 0.664, 0.165, and 0.106, respectively, thereby confirming the effectiveness and robustness of the approach.

Cite this article as:
S. Rakuka, K. Morita, and T. Wakabayashi, “Handwritten Character String Recognition Using a String Recognition Transformer,” J. Adv. Comput. Intell. Intell. Inform., Vol.30 No.1, pp. 15-23, 2026.
Data files:
References
  1. [1] S. Rakuka, K. Morita, and T. Wakabayashi, “Handwritten character string recognition using transformer and CNN features,” Proc. of 2024 Joint 13th Int. Conf. on Soft Computing and Intelligent Systems and 25th Int. Symp. on Advanced Intelligent Systems (SCIS&ISIS), 2024. https://doi.org/10.1109/SCISISIS61014.2024.10759989
  2. [2] C. Bartz, H. Yang, and C. Meinel, “STN-OCR: A single neural network for text detection and text recognition,” arXiv:1707.08831, 2017. https://doi.org/10.48550/arXiv.1707.08831
  3. [3] B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai, “Aster: An attentional scene text recognizer with flexible rectification,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.41, No.9, pp. 2035-2048, 2018. https://doi.org/10.1109/TPAMI.2018.2848939
  4. [4] B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.39, No.11, pp. 2298-2304, 2016. https://doi.org/10.1109/TPAMI.2016.2646371
  5. [5] B. Shi, X. Wang, P. Lyu, C. Yao, and X. Bai, “Robust scene text recognition with automatic rectification,” Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 4168-4176, 2016. https://doi.org/10.1109/CVPR.2016.452
  6. [6] C. Luo, L. Jin, and Z. Sun, “Moran: A multi-object rectified attention network for scene text recognition,” Pattern Recognition, Vol.90, pp. 109-118, 2019. https://doi.org/10.1016/j.patcog.2019.01.020
  7. [7] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems (NIPS 2017), Vol.30, 2017.
  8. [8] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv:2010.11929, 2020. https://doi.org/10.48550/arXiv.2010.11929
  9. [9] M. Li, T. Lv, J. Chen, L. Cui, Y. Lu, D. Florencio, C. Zhang, Z. Li, and F. Wei, “Trocr: Transformer-based optical character recognition with pre-trained models,” Proc. of the AAAI Conf. on Artificial Intelligence, Vol.37, No.11, pp. 13094-13102, 2023. https://doi.org/10.1609/aaai.v37i11.26538
  10. [10] M. Fujitake, “Dtrocr: Decoder-only transformer for optical character recognition,” Proc. of the IEEE/CVF Winter Conf. on Applications of Computer Vision, pp. 8025-8035, 2024. https://doi.org/10.1109/WACV57701.2024.00784
  11. [11] Y. Li, D. Chen, T. Tang, and X. Shen, “Htr-vt: Handwritten text recognition with vision transformer,” Pattern Recognition, Vol.158, Article No.110967, 2025. https://doi.org/10.1016/j.patcog.2024.110967
  12. [12] G. Kim, T. Hong, M. Yim, J. Nam, J. Park, J. Yim, W. Hwang, S. Yun, D. Han, and S. Park, “Ocr-free document understanding transformer,” European Conf. on Computer Vision, pp. 498-517, 2022. https://doi.org/10.1007/978-3-031-19815-1_29
  13. [13] F. Sheng, Z. Chen, and B. Xu, “Nrtr: A no-recurrence sequence-to-sequence model for scene text recognition,” 2019 Int. Conf. on Document Analysis and Recognition (ICDAR), pp. 781-786, 2019. https://doi.org/10.1109/ICDAR.2019.00130
  14. [14] J. Li, Y. Xu, T. Lv, L. Cui, C. Zhang, and F. Wei, “Dit: Self-supervised pre-training for document image transformer,” Proc. of the 30th ACM Int. Conf. on Multimedia, pp. 3530-3539, 2022. https://doi.org/10.1145/3503161.3547911
  15. [15] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” Advances in Neural Information Processing Systems, Vol.27, 2014.
  16. [16] Y. Baek, B. Lee, D. Han, S. Yun, and H. Lee, “Character region awareness for text detection,” Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 9365-9374, 2019. https://doi.org/10.1109/CVPR.2019.00959

*This site is desgined based on HTML5 and CSS3 for modern browsers, e.g. Chrome, Firefox, Safari, Edge, Opera.

Last updated on Jan. 21, 2026