Sub-Subword N-Gram Features for Subword-Level Neural Machine Translation

Ander Martinez; Katsuhito Sudoh; Yuji Matsumoto

doi:10.5715/jnlp.28.82

抄録

Neural machine translation (NMT) systems often use subword segmentation to limit vocabulary sizes. This type of segmentation is particularly useful for morphologically complex languages because their vocabularies can grow prohibitively large. This method can also replace infrequent tokens with more frequent subwords. Fine segmentation with short subword units has been shown to produce better results for smaller training datasets. Character-level NMT, which can be considered as an extreme case of subword segmentation in which each subword consists of a single character, can provide enhanced transliteration results, but also tends to produce grammatical errors. We propose a novel approach to this problem that combines subword-level segmentation with character-level information in the form of character n-gram features to construct embedding matrices and softmax output projections for a standard encoder-decoder model. We use a custom algorithm to select a small number of effective binary character n-gram features. Through four sets of experiments, we demonstrate the advantages of the proposed approach for processing resource-limited language pairs. Our proposed approach yields better performance in terms of BLEU score compared to subword- and character-based baseline methods under low-resource conditions. In particular, the proposed approach increases the vocabulary size for small training datasets without reducing translation quality.

著者関連情報

Licensed under CC BY 4.0
https://creativecommons.org/licenses/by/4.0/

お気に入り & アラート

閲覧履歴

責任著者(Corresponding author)

J-STAGEへの登録はこちら（無料）