Handwritten Character String Recognition Using a String Recognition Transformer

Shunya Rakuka; Kento Morita; Tetsushi Wakabayashi

doi:10.20965/jaciii.2026.p0015

single-jc.php

« previous

JACIII Vol.30 No.1 pp. 15-23

(2026)

doi: 10.20965/jaciii.2026.p0015

Research Paper:

Views over last 60 days: 148

Handwritten Character String Recognition Using a String Recognition Transformer

Shunya Rakuka^†, Kento Morita , and Tetsushi Wakabayashi

Graduate School of Engineering, Mie University
1577 Kurimamachiya-cho, Tsu, Mie 514-8507, Japan

^†Corresponding author

Received:

May 19, 2025

Accepted:

June 25, 2025

Published:

January 20, 2026

Keywords:

handwritten recognition, character string recognition, Transformer

Abstract

Improving the accuracy of handwritten character string recognition allows handwritten documents to be converted into digital text. This facilitates camera-based text input, enabling robotic process automation to manage documentation tasks. Although this field has seen significant progress, recognizing handwritten Japanese remains particularly challenging due to the difficulty of character segmentation, the wide variety of character types, and the absence of clear word boundaries. These factors make unconstrained handwritten Japanese string recognition particularly difficult for conventional approaches. Moreover, transformer-based models typically require large amounts of annotated training data. This study proposes and investigates a new String Recognition Transformer (SRT) model capable of recognizing unconstrained handwritten Japanese character strings without relying on explicit character segmentation or a large number of training images. The SRT model integrates a convolutional neural network backbone for robust local feature extraction, a Transformer encoder-decoder architecture, and a sliding window strategy that generates overlapping patches. Comparative experiments show that our method achieved a character error rate (CER) of 0.067, significantly outperforming convolutional recurrent neural network, transformer-based optical character recognition, and handwritten text recognition with Vision Transformer which achieved CERs of 0.664, 0.165, and 0.106, respectively, thereby confirming the effectiveness and robustness of the approach.

Cite this article as:

S. Rakuka, K. Morita, and T. Wakabayashi, “Handwritten Character String Recognition Using a String Recognition Transformer,” J. Adv. Comput. Intell. Intell. Inform., Vol.30 No.1, pp. 15-23, 2026.

Data files:

References

[1] S. Rakuka, K. Morita, and T. Wakabayashi, “Handwritten character string recognition using transformer and CNN features,” Proc. of 2024 Joint 13th Int. Conf. on Soft Computing and Intelligent Systems and 25th Int. Symp. on Advanced Intelligent Systems (SCIS&ISIS), 2024. https://doi.org/10.1109/SCISISIS61014.2024.10759989
[2] C. Bartz, H. Yang, and C. Meinel, “STN-OCR: A single neural network for text detection and text recognition,” arXiv:1707.08831, 2017. https://doi.org/10.48550/arXiv.1707.08831
[3] B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai, “Aster: An attentional scene text recognizer with flexible rectification,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.41, No.9, pp. 2035-2048, 2018. https://doi.org/10.1109/TPAMI.2018.2848939
[4] B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.39, No.11, pp. 2298-2304, 2016. https://doi.org/10.1109/TPAMI.2016.2646371
[5] B. Shi, X. Wang, P. Lyu, C. Yao, and X. Bai, “Robust scene text recognition with automatic rectification,” Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 4168-4176, 2016. https://doi.org/10.1109/CVPR.2016.452
[6] C. Luo, L. Jin, and Z. Sun, “Moran: A multi-object rectified attention network for scene text recognition,” Pattern Recognition, Vol.90, pp. 109-118, 2019. https://doi.org/10.1016/j.patcog.2019.01.020
[7] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems (NIPS 2017), Vol.30, 2017.
[8] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv:2010.11929, 2020. https://doi.org/10.48550/arXiv.2010.11929
[9] M. Li, T. Lv, J. Chen, L. Cui, Y. Lu, D. Florencio, C. Zhang, Z. Li, and F. Wei, “Trocr: Transformer-based optical character recognition with pre-trained models,” Proc. of the AAAI Conf. on Artificial Intelligence, Vol.37, No.11, pp. 13094-13102, 2023. https://doi.org/10.1609/aaai.v37i11.26538
[10] M. Fujitake, “Dtrocr: Decoder-only transformer for optical character recognition,” Proc. of the IEEE/CVF Winter Conf. on Applications of Computer Vision, pp. 8025-8035, 2024. https://doi.org/10.1109/WACV57701.2024.00784
[11] Y. Li, D. Chen, T. Tang, and X. Shen, “Htr-vt: Handwritten text recognition with vision transformer,” Pattern Recognition, Vol.158, Article No.110967, 2025. https://doi.org/10.1016/j.patcog.2024.110967
[12] G. Kim, T. Hong, M. Yim, J. Nam, J. Park, J. Yim, W. Hwang, S. Yun, D. Han, and S. Park, “Ocr-free document understanding transformer,” European Conf. on Computer Vision, pp. 498-517, 2022. https://doi.org/10.1007/978-3-031-19815-1_29
[13] F. Sheng, Z. Chen, and B. Xu, “Nrtr: A no-recurrence sequence-to-sequence model for scene text recognition,” 2019 Int. Conf. on Document Analysis and Recognition (ICDAR), pp. 781-786, 2019. https://doi.org/10.1109/ICDAR.2019.00130
[14] J. Li, Y. Xu, T. Lv, L. Cui, C. Zhang, and F. Wei, “Dit: Self-supervised pre-training for document image transformer,” Proc. of the 30th ACM Int. Conf. on Multimedia, pp. 3530-3539, 2022. https://doi.org/10.1145/3503161.3547911
[15] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” Advances in Neural Information Processing Systems, Vol.27, 2014.
[16] Y. Baek, B. Lee, D. Han, S. Yun, and H. Lee, “Character region awareness for text detection,” Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 9365-9374, 2019. https://doi.org/10.1109/CVPR.2019.00959

This article is published under a Creative Commons Attribution-NoDerivatives 4.0 Internationa License.

[B1] [1] S. Rakuka, K. Morita, and T. Wakabayashi, “Handwritten character string recognition using transformer and CNN features,” Proc. of 2024 Joint 13th Int. Conf. on Soft Computing and Intelligent Systems and 25th Int. Symp. on Advanced Intelligent Systems (SCIS&ISIS), 2024. https://doi.org/10.1109/SCISISIS61014.2024.10759989

[B2] [2] C. Bartz, H. Yang, and C. Meinel, “STN-OCR: A single neural network for text detection and text recognition,” arXiv:1707.08831, 2017. https://doi.org/10.48550/arXiv.1707.08831

[B3] [3] B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai, “Aster: An attentional scene text recognizer with flexible rectification,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.41, No.9, pp. 2035-2048, 2018. https://doi.org/10.1109/TPAMI.2018.2848939

[B4] [4] B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.39, No.11, pp. 2298-2304, 2016. https://doi.org/10.1109/TPAMI.2016.2646371

[B5] [5] B. Shi, X. Wang, P. Lyu, C. Yao, and X. Bai, “Robust scene text recognition with automatic rectification,” Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 4168-4176, 2016. https://doi.org/10.1109/CVPR.2016.452

[B6] [6] C. Luo, L. Jin, and Z. Sun, “Moran: A multi-object rectified attention network for scene text recognition,” Pattern Recognition, Vol.90, pp. 109-118, 2019. https://doi.org/10.1016/j.patcog.2019.01.020

[B7] [7] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems (NIPS 2017), Vol.30, 2017.

[B8] [8] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv:2010.11929, 2020. https://doi.org/10.48550/arXiv.2010.11929

[B9] [9] M. Li, T. Lv, J. Chen, L. Cui, Y. Lu, D. Florencio, C. Zhang, Z. Li, and F. Wei, “Trocr: Transformer-based optical character recognition with pre-trained models,” Proc. of the AAAI Conf. on Artificial Intelligence, Vol.37, No.11, pp. 13094-13102, 2023. https://doi.org/10.1609/aaai.v37i11.26538

[B10] [10] M. Fujitake, “Dtrocr: Decoder-only transformer for optical character recognition,” Proc. of the IEEE/CVF Winter Conf. on Applications of Computer Vision, pp. 8025-8035, 2024. https://doi.org/10.1109/WACV57701.2024.00784

[B11] [11] Y. Li, D. Chen, T. Tang, and X. Shen, “Htr-vt: Handwritten text recognition with vision transformer,” Pattern Recognition, Vol.158, Article No.110967, 2025. https://doi.org/10.1016/j.patcog.2024.110967

[B12] [12] G. Kim, T. Hong, M. Yim, J. Nam, J. Park, J. Yim, W. Hwang, S. Yun, D. Han, and S. Park, “Ocr-free document understanding transformer,” European Conf. on Computer Vision, pp. 498-517, 2022. https://doi.org/10.1007/978-3-031-19815-1_29

[B13] [13] F. Sheng, Z. Chen, and B. Xu, “Nrtr: A no-recurrence sequence-to-sequence model for scene text recognition,” 2019 Int. Conf. on Document Analysis and Recognition (ICDAR), pp. 781-786, 2019. https://doi.org/10.1109/ICDAR.2019.00130

[B14] [14] J. Li, Y. Xu, T. Lv, L. Cui, C. Zhang, and F. Wei, “Dit: Self-supervised pre-training for document image transformer,” Proc. of the 30th ACM Int. Conf. on Multimedia, pp. 3530-3539, 2022. https://doi.org/10.1145/3503161.3547911

[B15] [15] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” Advances in Neural Information Processing Systems, Vol.27, 2014.

[B16] [16] Y. Baek, B. Lee, D. Han, S. Yun, and H. Lee, “Character region awareness for text detection,” Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 9365-9374, 2019. https://doi.org/10.1109/CVPR.2019.00959

Handwritten Character String Recognition Using a String Recognition Transformer

Shunya Rakuka†, Kento Morita , and Tetsushi Wakabayashi

Shunya Rakuka^†, Kento Morita , and Tetsushi Wakabayashi