An Improved Byte Pair Encoding Method for Tibetan

Kalzang Gyatso; Sonam Tshering; Tashi Norbu; Nyima Tashi; Tong Xiao; Jingbo Zhu; Garma Tashi; Gaden Luosang

doi:10.20965/jaciii.2025.p1273

single-jc.php

« previous

JACIII Vol.29 No.6 pp. 1273-1282

(2025)

doi: 10.20965/jaciii.2025.p1273

Research Paper:

Views over last 60 days: 583

An Improved Byte Pair Encoding Method for Tibetan

Kalzang Gyatso^1 , Sonam Tshering^2, Tashi Norbu^3, Nyima Tashi^4,†, Tong Xiao^5, Jingbo Zhu^5, Garma Tashi^4, and Gaden Luosang^4

^*1School of Intelligence Science and Engineering, Qinghai Minzu University
No.3 Bayi Middle Road, Xining, Qinghai 810007, China

^*2School of Literature, Tibet University
No.36 Jiangsu Road, Chengguan District, Lhasa, Tibet 850000, China

^*3College of Engineering, Yanbian University
No.977 Gongyuan Road, Yanji, Jilin 133002, China

^*4School of Information Science and Technology, Tibet University
No.36 Jiangsu Road, Chengguan District, Lhasa, Tibet 850000, China

^†Corresponding author

^*5Northeastern University
No.3-11 Wenhua Road, Heping District, Shenyang, Liaoning 110819, China

Received:

February 27, 2025

Accepted:

June 9, 2025

Published:

November 20, 2025

Keywords:

Tibetan byte pair encoding (BPE), Tibetan-Chinese machine translation, Tibetan agglutinative words

Abstract

Byte pair encoding (BPE) plays a crucial role in natural language processing tasks by effectively reducing vocabulary redundancy and alleviating the out-of-vocabulary problem. However, when applied to Tibetan language tasks, the standard BPE method fails to fully exploit its advantages due to the unique characteristics of the Tibetan script. As a result, some subwords in the vocabulary that violate standard Tibetan orthographic conventions, introduce noise into the model and degrade downstream task performance. To address this issue, this paper investigates the agglutinative nature of Tibetan words and proposes an improved BPE approach specifically designed for Tibetan. We apply the method to a Tibetan-Chinese machine translation system and evaluate its effectiveness through a series of experiments. The results demonstrate that the proposed method not only corrects malformed subwords and enhances translation quality, but also significantly reduces vocabulary size, laying a solid foundation for future research in Tibetan word representation and downstream natural language processing applications. Our method achieves consistent improvements in BLEU scores across most test sets, with gains exceeding 2 points in the best case.

Cite this article as:

K. Gyatso, S. Tshering, T. Norbu, N. Tashi, T. Xiao, J. Zhu, G. Tashi, and G. Luosang, “An Improved Byte Pair Encoding Method for Tibetan,” J. Adv. Comput. Intell. Intell. Inform., Vol.29 No.6, pp. 1273-1282, 2025.

Data files:

References

[1] R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” Proc. of the 54th Annual Meeting of the Association for Computational Linguistics (Vol.1: Long Papers), pp. 1715-1725, 2016. https://doi.org/10.18653/v1/P16-1162
[2] J. Libovický, H. Schmid, and A. Fraser, “Why don’t people use character-level machine translation?” Findings of the Association for Computational Linguistics: ACL 2022, pp. 2470-2485, 2022. https://doi.org/10.18653/v1/2022.findings-acl.194
[3] P.-C. Chang, M. Galley, and C. D. Manning, “Optimizing Chinese word segmentation for machine translation performance,” Proc. of the Third Workshop on Statistical Machine Translation, pp. 224-232, 2008.
[4] R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword Units,” arXiv:1508.07909, 2016. https://doi.org/10.48550/arXiv.1508.07909
[5] X. Gutierrez-Vasques, C. Bentz, and T. Samardžić, “Languages through the looking glass of BPE compression,” Computational Linguistics, Vol.49, No.4, pp. 943-1001, 2023. https://doi.org/10.1162/coli_a_00489
[6] R. Sennrich, B. Haddow, and A. Birch, “Improving Neural Machine Translation Models with Monolingual Data,” in Proc. of the 54th Annual Meeting of the Association for Computational Linguistics (Vol.1: Long Papers), K. Erk and N. A. Smith (Eds.), Berlin, Germany: Association for Computational Linguistics, pp. 86-96, 2016. https://doi.org/10.18653/v1/P16-1009
[7] J. Yang, Y. Zhang, and S. Liang, “Subword encoding in lattice LSTM for Chinese word segmentation,” Proc. of the 2019 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol.1 (Long and Short Papers), pp. 2720-2725, 2019. https://doi.org/10.18653/v1/N19-1278
[8] Y. Wang, L. Zhou, J. Zhang, and C. Zong, “Word, Subword or Character? An Empirical Study of Granularity in Chinese-English NMT,” arXiv:1711.04457, 2025. https://doi.org/10.48550/arXiv.1711.04457
[9] P.-C. Chang, M. Galley, and C. D. Manning, “Optimizing Chinese word segmentation for machine translation performance,” Proc. of the Third Workshop on Statistical Machine Translation, pp. 224-232, 2008.
[10] J. Libovický, H. Schmid, and A. Fraser, “Why don’t people use character-level machine translation?,” in Findings of the Association for Computational Linguistics: ACL 2022, pp. 2470-2485, 2022. https://doi.org/10.18653/v1/2022.findings-acl.194
[11] P. Tuilha, “On the Tibetan traditional punctuations and its standardization,” China Tibetology (Tibetan Edition), Vol.2, pp. 139-158, 2019.
[12] S. V. Beyer, “The Classical Tibetan Language,” State University of New York Press, 1992.
[13] R. N. Gyatso, L. N. Tsonawa, and T. Rigzin, “A short history of Tibetan script,” The Tibet J., Vol.9, No.2, pp. 28-30, 1984.
[14] H. A. Jäschke, A. H. Francke, and W. Simon, “Tibetan grammar,” Walter de Gruyter, Berlin, 1929.
[15] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” arXiv:1706.03762, 2017. https://doi.org/10.48550/arXiv.1706.03762.
[16] K. Gyatso, A. Reheman, M. Gyal, N. Tashi, X. Tong, and Zhu Jingbo, “Research on Tibetan word segmentation method combining Bi-LSTM and CRF,” J. of Minzu University of China (Natural Sciences Edition), Vol.3, pp. 40-46, 2024.
[17] T. Tsering, D. Renqing, N. Zashi, Y. Yu, and Q. Deng, “Research on Chinese–Tibetan machine translation model based on improved byte pair encoding,” J. of University of Electronic Science and Technology of China, Vol.50, No.2, pp. 249-255, 2021. https://doi.org/10.12178/1001-0548.2020218
[18] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A method for automatic evaluation of machine translation,” Proc. of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311-318, 2002. https://doi.org/10.3115/1073083.1073135
[19] K. Gyatso, P. Liu, Y. Jing, Y. Li, N. Tashi, T. Xiao, and J. Zhu, “CCMT2023 Tibetan–Chinese machine translation evaluation technical report,” Machine Translation: 19th China Conference (CCMT 2023), pp. 28-36, 2023. https://doi.org/10.1007/978-981-99-7894-6_3

This article is published under a Creative Commons Attribution-NoDerivatives 4.0 Internationa License.

[B1] [1] R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” Proc. of the 54th Annual Meeting of the Association for Computational Linguistics (Vol.1: Long Papers), pp. 1715-1725, 2016. https://doi.org/10.18653/v1/P16-1162

[B2] [2] J. Libovický, H. Schmid, and A. Fraser, “Why don’t people use character-level machine translation?” Findings of the Association for Computational Linguistics: ACL 2022, pp. 2470-2485, 2022. https://doi.org/10.18653/v1/2022.findings-acl.194

[B3] [3] P.-C. Chang, M. Galley, and C. D. Manning, “Optimizing Chinese word segmentation for machine translation performance,” Proc. of the Third Workshop on Statistical Machine Translation, pp. 224-232, 2008.

[B4] [4] R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword Units,” arXiv:1508.07909, 2016. https://doi.org/10.48550/arXiv.1508.07909

[B5] [5] X. Gutierrez-Vasques, C. Bentz, and T. Samardžić, “Languages through the looking glass of BPE compression,” Computational Linguistics, Vol.49, No.4, pp. 943-1001, 2023. https://doi.org/10.1162/coli_a_00489

[B6] [6] R. Sennrich, B. Haddow, and A. Birch, “Improving Neural Machine Translation Models with Monolingual Data,” in Proc. of the 54th Annual Meeting of the Association for Computational Linguistics (Vol.1: Long Papers), K. Erk and N. A. Smith (Eds.), Berlin, Germany: Association for Computational Linguistics, pp. 86-96, 2016. https://doi.org/10.18653/v1/P16-1009

[B7] [7] J. Yang, Y. Zhang, and S. Liang, “Subword encoding in lattice LSTM for Chinese word segmentation,” Proc. of the 2019 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol.1 (Long and Short Papers), pp. 2720-2725, 2019. https://doi.org/10.18653/v1/N19-1278

[B8] [8] Y. Wang, L. Zhou, J. Zhang, and C. Zong, “Word, Subword or Character? An Empirical Study of Granularity in Chinese-English NMT,” arXiv:1711.04457, 2025. https://doi.org/10.48550/arXiv.1711.04457

[B9] [9] P.-C. Chang, M. Galley, and C. D. Manning, “Optimizing Chinese word segmentation for machine translation performance,” Proc. of the Third Workshop on Statistical Machine Translation, pp. 224-232, 2008.

[B10] [10] J. Libovický, H. Schmid, and A. Fraser, “Why don’t people use character-level machine translation?,” in Findings of the Association for Computational Linguistics: ACL 2022, pp. 2470-2485, 2022. https://doi.org/10.18653/v1/2022.findings-acl.194

[B11] [11] P. Tuilha, “On the Tibetan traditional punctuations and its standardization,” China Tibetology (Tibetan Edition), Vol.2, pp. 139-158, 2019.

[B12] [12] S. V. Beyer, “The Classical Tibetan Language,” State University of New York Press, 1992.

[B13] [13] R. N. Gyatso, L. N. Tsonawa, and T. Rigzin, “A short history of Tibetan script,” The Tibet J., Vol.9, No.2, pp. 28-30, 1984.

[B14] [14] H. A. Jäschke, A. H. Francke, and W. Simon, “Tibetan grammar,” Walter de Gruyter, Berlin, 1929.

[B15] [15] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” arXiv:1706.03762, 2017. https://doi.org/10.48550/arXiv.1706.03762.

[B16] [16] K. Gyatso, A. Reheman, M. Gyal, N. Tashi, X. Tong, and Zhu Jingbo, “Research on Tibetan word segmentation method combining Bi-LSTM and CRF,” J. of Minzu University of China (Natural Sciences Edition), Vol.3, pp. 40-46, 2024.

[B17] [17] T. Tsering, D. Renqing, N. Zashi, Y. Yu, and Q. Deng, “Research on Chinese–Tibetan machine translation model based on improved byte pair encoding,” J. of University of Electronic Science and Technology of China, Vol.50, No.2, pp. 249-255, 2021. https://doi.org/10.12178/1001-0548.2020218

[B18] [18] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A method for automatic evaluation of machine translation,” Proc. of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311-318, 2002. https://doi.org/10.3115/1073083.1073135

[B19] [19] K. Gyatso, P. Liu, Y. Jing, Y. Li, N. Tashi, T. Xiao, and J. Zhu, “CCMT2023 Tibetan–Chinese machine translation evaluation technical report,” Machine Translation: 19th China Conference (CCMT 2023), pp. 28-36, 2023. https://doi.org/10.1007/978-981-99-7894-6_3

An Improved Byte Pair Encoding Method for Tibetan

Kalzang Gyatso*1 , Sonam Tshering*2, Tashi Norbu*3, Nyima Tashi*4,†, Tong Xiao*5, Jingbo Zhu*5, Garma Tashi*4, and Gaden Luosang*4

Kalzang Gyatso^1 , Sonam Tshering^2, Tashi Norbu^3, Nyima Tashi^4,†, Tong Xiao^5, Jingbo Zhu^5, Garma Tashi^4, and Gaden Luosang^4