2021 Volume 28 Issue 2 Pages 632-650
This paper proposes a new subword segmentation method for neural machine translation, called bilingual subword segmentation, which tokenizes sentences to minimize the difference between the number of subword units in a sentence and that in its translation. While existing methods tokenize a sentence without considering its translation, the proposed method tokenizes a sentence using subword units obtained from bilingual sentences and is thus suitable for machine translation. The method was evaluated on WAT Asian Scientific Paper Excerpt Corpus (ASPEC) English-to-Japanese, Japanese-to-English, English-to-Chinese, and Chinese-to-Ensglish translation tasks and WMT14 English-to-German and German-to-English translation tasks. The evaluation results reveal that the proposed method improves the performance of Transformer neural machine translation (up to +0.81 BLEU (%)).