自然言語処理
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
一般論文
Sub-Subword N-Gram Features for Subword-Level Neural Machine Translation
Ander MartinezKatsuhito SudohYuji Matsumoto
著者情報
ジャーナル フリー

2021 年 28 巻 1 号 p. 82-103

詳細
抄録

Neural machine translation (NMT) systems often use subword segmentation to limit vocabulary sizes. This type of segmentation is particularly useful for morphologically complex languages because their vocabularies can grow prohibitively large. This method can also replace infrequent tokens with more frequent subwords. Fine segmentation with short subword units has been shown to produce better results for smaller training datasets. Character-level NMT, which can be considered as an extreme case of subword segmentation in which each subword consists of a single character, can provide enhanced transliteration results, but also tends to produce grammatical errors. We propose a novel approach to this problem that combines subword-level segmentation with character-level information in the form of character n-gram features to construct embedding matrices and softmax output projections for a standard encoder-decoder model. We use a custom algorithm to select a small number of effective binary character n-gram features. Through four sets of experiments, we demonstrate the advantages of the proposed approach for processing resource-limited language pairs. Our proposed approach yields better performance in terms of BLEU score compared to subword- and character-based baseline methods under low-resource conditions. In particular, the proposed approach increases the vocabulary size for small training datasets without reducing translation quality.

著者関連情報
© 2021 The Association for Natural Language Processing
前の記事 次の記事
feedback
Top