single-jc.php

JACIII Vol.29 No.6 pp. 1273-1282
doi: 10.20965/jaciii.2025.p1273
(2025)

Research Paper:

An Improved Byte Pair Encoding Method for Tibetan

Kalzang Gyatso*1 ORCID Icon, Sonam Tshering*2, Tashi Norbu*3, Nyima Tashi*4,†, Tong Xiao*5, Jingbo Zhu*5, Garma Tashi*4, and Gaden Luosang*4

*1School of Intelligence Science and Engineering, Qinghai Minzu University
No.3 Bayi Middle Road, Xining, Qinghai 810007, China

*2School of Literature, Tibet University
No.36 Jiangsu Road, Chengguan District, Lhasa, Tibet 850000, China

*3College of Engineering, Yanbian University
No.977 Gongyuan Road, Yanji, Jilin 133002, China

*4School of Information Science and Technology, Tibet University
No.36 Jiangsu Road, Chengguan District, Lhasa, Tibet 850000, China

Corresponding author

*5Northeastern University
No.3-11 Wenhua Road, Heping District, Shenyang, Liaoning 110819, China

Received:
February 27, 2025
Accepted:
June 9, 2025
Published:
November 20, 2025
Keywords:
Tibetan byte pair encoding (BPE), Tibetan-Chinese machine translation, Tibetan agglutinative words
Abstract

Byte pair encoding (BPE) plays a crucial role in natural language processing tasks by effectively reducing vocabulary redundancy and alleviating the out-of-vocabulary problem. However, when applied to Tibetan language tasks, the standard BPE method fails to fully exploit its advantages due to the unique characteristics of the Tibetan script. As a result, some subwords in the vocabulary that violate standard Tibetan orthographic conventions, introduce noise into the model and degrade downstream task performance. To address this issue, this paper investigates the agglutinative nature of Tibetan words and proposes an improved BPE approach specifically designed for Tibetan. We apply the method to a Tibetan-Chinese machine translation system and evaluate its effectiveness through a series of experiments. The results demonstrate that the proposed method not only corrects malformed subwords and enhances translation quality, but also significantly reduces vocabulary size, laying a solid foundation for future research in Tibetan word representation and downstream natural language processing applications. Our method achieves consistent improvements in BLEU scores across most test sets, with gains exceeding 2 points in the best case.

Cite this article as:
K. Gyatso, S. Tshering, T. Norbu, N. Tashi, T. Xiao, J. Zhu, G. Tashi, and G. Luosang, “An Improved Byte Pair Encoding Method for Tibetan,” J. Adv. Comput. Intell. Intell. Inform., Vol.29 No.6, pp. 1273-1282, 2025.
Data files:
References
  1. [1] R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” Proc. of the 54th Annual Meeting of the Association for Computational Linguistics (Vol.1: Long Papers), pp. 1715-1725, 2016. https://doi.org/10.18653/v1/P16-1162
  2. [2] J. Libovický, H. Schmid, and A. Fraser, “Why don’t people use character-level machine translation?” Findings of the Association for Computational Linguistics: ACL 2022, pp. 2470-2485, 2022. https://doi.org/10.18653/v1/2022.findings-acl.194
  3. [3] P.-C. Chang, M. Galley, and C. D. Manning, “Optimizing Chinese word segmentation for machine translation performance,” Proc. of the Third Workshop on Statistical Machine Translation, pp. 224-232, 2008.
  4. [4] R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword Units,” arXiv:1508.07909, 2016. https://doi.org/10.48550/arXiv.1508.07909
  5. [5] X. Gutierrez-Vasques, C. Bentz, and T. Samardžić, “Languages through the looking glass of BPE compression,” Computational Linguistics, Vol.49, No.4, pp. 943-1001, 2023. https://doi.org/10.1162/coli_a_00489
  6. [6] R. Sennrich, B. Haddow, and A. Birch, “Improving Neural Machine Translation Models with Monolingual Data,” in Proc. of the 54th Annual Meeting of the Association for Computational Linguistics (Vol.1: Long Papers), K. Erk and N. A. Smith (Eds.), Berlin, Germany: Association for Computational Linguistics, pp. 86-96, 2016. https://doi.org/10.18653/v1/P16-1009
  7. [7] J. Yang, Y. Zhang, and S. Liang, “Subword encoding in lattice LSTM for Chinese word segmentation,” Proc. of the 2019 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol.1 (Long and Short Papers), pp. 2720-2725, 2019. https://doi.org/10.18653/v1/N19-1278
  8. [8] Y. Wang, L. Zhou, J. Zhang, and C. Zong, “Word, Subword or Character? An Empirical Study of Granularity in Chinese-English NMT,” arXiv:1711.04457, 2025. https://doi.org/10.48550/arXiv.1711.04457
  9. [9] P.-C. Chang, M. Galley, and C. D. Manning, “Optimizing Chinese word segmentation for machine translation performance,” Proc. of the Third Workshop on Statistical Machine Translation, pp. 224-232, 2008.
  10. [10] J. Libovický, H. Schmid, and A. Fraser, “Why don’t people use character-level machine translation?,” in Findings of the Association for Computational Linguistics: ACL 2022, pp. 2470-2485, 2022. https://doi.org/10.18653/v1/2022.findings-acl.194
  11. [11] P. Tuilha, “On the Tibetan traditional punctuations and its standardization,” China Tibetology (Tibetan Edition), Vol.2, pp. 139-158, 2019.
  12. [12] S. V. Beyer, “The Classical Tibetan Language,” State University of New York Press, 1992.
  13. [13] R. N. Gyatso, L. N. Tsonawa, and T. Rigzin, “A short history of Tibetan script,” The Tibet J., Vol.9, No.2, pp. 28-30, 1984.
  14. [14] H. A. Jäschke, A. H. Francke, and W. Simon, “Tibetan grammar,” Walter de Gruyter, Berlin, 1929.
  15. [15] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” arXiv:1706.03762, 2017. https://doi.org/10.48550/arXiv.1706.03762.
  16. [16] K. Gyatso, A. Reheman, M. Gyal, N. Tashi, X. Tong, and Zhu Jingbo, “Research on Tibetan word segmentation method combining Bi-LSTM and CRF,” J. of Minzu University of China (Natural Sciences Edition), Vol.3, pp. 40-46, 2024.
  17. [17] T. Tsering, D. Renqing, N. Zashi, Y. Yu, and Q. Deng, “Research on Chinese–Tibetan machine translation model based on improved byte pair encoding,” J. of University of Electronic Science and Technology of China, Vol.50, No.2, pp. 249-255, 2021. https://doi.org/10.12178/1001-0548.2020218
  18. [18] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A method for automatic evaluation of machine translation,” Proc. of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311-318, 2002. https://doi.org/10.3115/1073083.1073135
  19. [19] K. Gyatso, P. Liu, Y. Jing, Y. Li, N. Tashi, T. Xiao, and J. Zhu, “CCMT2023 Tibetan–Chinese machine translation evaluation technical report,” Machine Translation: 19th China Conference (CCMT 2023), pp. 28-36, 2023. https://doi.org/10.1007/978-981-99-7894-6_3

*This site is desgined based on HTML5 and CSS3 for modern browsers, e.g. Chrome, Firefox, Safari, Edge, Opera.

Last updated on Nov. 19, 2025