Paraphrasing Training Data for Statistical Machine Translation

Eric Nichols; Francis Bond; D. Scott Appling; Yuji Matsumoto

doi:10.11185/imt.5.950

Information Systems and Applications

Paraphrasing Training Data for Statistical Machine Translation

Eric Nichols, Francis Bond, D. Scott Appling, Yuji Matsumoto

著者情報

キーワード: Natural Language Processing, Machine Translation, Paraphrasing, HPSG

ジャーナルフリー

2010 年 5 巻 2 号 p. 950-971

DOI https://doi.org/10.11185/imt.5.950

詳細

抄録

Large amounts of data are essential for training statistical machine translation systems. In this paper we show how training data can be expanded by paraphrasing one side of a parallel corpus. The new data is made by parsing then generating using an open-source, precise HPSG-based grammar. This gives sentences with the same meaning, but with minor variations in lexical choice and word order. In experiments paraphrasing the English in the Tanaka Corpus, a freely-available Japanese-English parallel corpus, we show consistent, statistically-significant gains on training data sets ranging from 10,000 to 147,000 sentence pairs in size as evaluated by the BLEU and METEOR automatic evaluation metrics.

責任著者(Corresponding author)

J-STAGEへの登録はこちら（無料）