Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
General Paper
Bilingual Subword Segmentation for Neural Machine Translation
Hiroyuki DeguchiMasao UtiyamaAkihiro TamuraTakashi NinomiyaEiichiro Sumita
Author information
JOURNAL FREE ACCESS

2021 Volume 28 Issue 2 Pages 632-650

Details
Abstract

This paper proposes a new subword segmentation method for neural machine translation, called bilingual subword segmentation, which tokenizes sentences to minimize the difference between the number of subword units in a sentence and that in its translation. While existing methods tokenize a sentence without considering its translation, the proposed method tokenizes a sentence using subword units obtained from bilingual sentences and is thus suitable for machine translation. The method was evaluated on WAT Asian Scientific Paper Excerpt Corpus (ASPEC) English-to-Japanese, Japanese-to-English, English-to-Chinese, and Chinese-to-Ensglish translation tasks and WMT14 English-to-German and German-to-English translation tasks. The evaluation results reveal that the proposed method improves the performance of Transformer neural machine translation (up to +0.81 BLEU (%)).

Content from these authors
© 2021 The Association for Natural Language Processing
Previous article Next article
feedback
Top