Journal of Information Processing
Online ISSN : 1882-6652
Inflating a Small Parallel Corpus into a Large Quasi-parallel Corpus Using Monolingual Data for Chinese-Japanese Machine Translation
Wei YangHanfei ShenYves Lepage
Author information
JOURNALS FREE ACCESS

Volume 25 (2017) Pages 88-99

Details
Download PDF (2168K) Contact us
Abstract

Increasing the size of parallel corpora for less-resourced language pairs is essential for machine translation (MT). To address the shortage of parallel corpora between Chinese and Japanese, we propose a method to construct a quasi-parallel corpus by inflating a small amount of Chinese-Japanese corpus, so as to improve statistical machine translation (SMT) quality. We generate new sentences using analogical associations based on large amounts of monolingual data and a small amount of parallel data. We filter over-generated sentences using two filtering methods: one based on BLEU and the second one based on N-sequences. We add the obtained aligned quasi-parallel corpus to a small parallel Chinese-Japanese corpus and perform SMT experiments. We obtain significant improvements over a baseline system.

Information related to the author
© 2017 by the Information Processing Society of Japan
Previous article Next article

Recently visited articles
feedback
Top