2025 Volume 29 Issue 6 Pages 1273-1282
Byte pair encoding (BPE) plays a crucial role in natural language processing tasks by effectively reducing vocabulary redundancy and alleviating the out-of-vocabulary problem. However, when applied to Tibetan language tasks, the standard BPE method fails to fully exploit its advantages due to the unique characteristics of the Tibetan script. As a result, some subwords in the vocabulary that violate standard Tibetan orthographic conventions, introduce noise into the model and degrade downstream task performance. To address this issue, this paper investigates the agglutinative nature of Tibetan words and proposes an improved BPE approach specifically designed for Tibetan. We apply the method to a Tibetan-Chinese machine translation system and evaluate its effectiveness through a series of experiments. The results demonstrate that the proposed method not only corrects malformed subwords and enhances translation quality, but also significantly reduces vocabulary size, laying a solid foundation for future research in Tibetan word representation and downstream natural language processing applications. Our method achieves consistent improvements in BLEU scores across most test sets, with gains exceeding 2 points in the best case.
This article cannot obtain the latest cited-by information.