An Improved Byte Pair Encoding Method for Tibetan

Kalzang Gyatso; Sonam Tshering; Tashi Norbu; Nyima Tashi; Tong Xiao; Jingbo Zhu; Garma Tashi; Gaden Luosang

doi:10.20965/jaciii.2025.p1273

Regular Papers

An Improved Byte Pair Encoding Method for Tibetan

Kalzang Gyatso, Sonam Tshering, Tashi Norbu, Nyima Tashi , Tong Xiao, Jingbo Zhu, Garma Tashi, Gaden Luosang

Author information

Keywords: Tibetan byte pair encoding (BPE), Tibetan-Chinese machine translation, Tibetan agglutinative words

JOURNAL OPEN ACCESS

2025 Volume 29 Issue 6 Pages 1273-1282

DOI https://doi.org/10.20965/jaciii.2025.p1273

Details

Abstract

Byte pair encoding (BPE) plays a crucial role in natural language processing tasks by effectively reducing vocabulary redundancy and alleviating the out-of-vocabulary problem. However, when applied to Tibetan language tasks, the standard BPE method fails to fully exploit its advantages due to the unique characteristics of the Tibetan script. As a result, some subwords in the vocabulary that violate standard Tibetan orthographic conventions, introduce noise into the model and degrade downstream task performance. To address this issue, this paper investigates the agglutinative nature of Tibetan words and proposes an improved BPE approach specifically designed for Tibetan. We apply the method to a Tibetan-Chinese machine translation system and evaluate its effectiveness through a series of experiments. The results demonstrate that the proposed method not only corrects malformed subwords and enhances translation quality, but also significantly reduces vocabulary size, laying a solid foundation for future research in Tibetan word representation and downstream natural language processing applications. Our method achieves consistent improvements in BLEU scores across most test sets, with gains exceeding 2 points in the best case.

Corresponding author

Funder information

1.Fund name: National Key R&D Program of China

2.Fund name: Natural Science Foundation of Liaoning Province of China

3.Fund name: Major Science and Technology Special Plan Projects of Yunnan Province

4.Fund name: Fundamental Research Funds for the Central Universities

5.Fund name: Fundamental Research Funds for the Central Universities

Register with J-STAGE for free!