ニューラル機械翻訳のためのバイリンガルなサブワード分割

出口 祥之; 内山 将夫; 田村 晃裕; 二宮 崇; 隅田 英一郎

doi:10.5715/jnlp.28.632

Abstract

This paper proposes a new subword segmentation method for neural machine translation, called bilingual subword segmentation, which tokenizes sentences to minimize the difference between the number of subword units in a sentence and that in its translation. While existing methods tokenize a sentence without considering its translation, the proposed method tokenizes a sentence using subword units obtained from bilingual sentences and is thus suitable for machine translation. The method was evaluated on WAT Asian Scientific Paper Excerpt Corpus (ASPEC) English-to-Japanese, Japanese-to-English, English-to-Chinese, and Chinese-to-Ensglish translation tasks and WMT14 English-to-German and German-to-English translation tasks. The evaluation results reveal that the proposed method improves the performance of Transformer neural machine translation (up to +0.81 BLEU (%)).

Content from these authors

Licensed under CC BY 4.0
https://creativecommons.org/licenses/by/4.0/

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!