Recent trends in the pre-training and fine-tuning paradigm have made significant advances in several natural language processing tasks, including machine translation (MT), particularly for low-resource situations. However, it is reported that leveraging out-of-domain data is not as effective, or sometimes even harmful, in MT tasks in high-resource situations, where further improvement is still needed. In this study, we focus on domain-specific dedicated neural machine translation (NMT) models, which still have the advantage in a high-resource situation as concerns translation quality and inference cost. We revisit the in-domain pre-training of embedding layers in Transformer-based NMT models, in which the embeddings are pre-trained with the same training data as the target translation task, considering the large impact of the domain discrepancy between the pre-training and fine-tuning (or training) in MT tasks. Experiments on two translation tasks, ASPEC English-to-Japanese and WMT2017 English-to-German, demonstrate that the in-domain pre-training of embedding layers in a Transformer-based NMT model provides performance improvement without any negative impact and contributes to earlier convergence in training. Through additional experiments, we confirmed that pre-training of the embedding layer of the encoder is more important than that of the embedding layer of the decoder, and the impact does not vanish as the training data size is increased. An analysis of the embeddings revealed the large impact of the pre-training of the embedding layers on the low-frequency tokens.
抄録全体を表示