Information and Media Technologies
Online ISSN : 1881-0896
ISSN-L : 1881-0896
Information Systems and Applications
Paraphrasing Training Data for Statistical Machine Translation
Eric NicholsFrancis BondD. Scott ApplingYuji Matsumoto
著者情報
ジャーナル フリー

2010 年 5 巻 2 号 p. 950-971

詳細
抄録

Large amounts of data are essential for training statistical machine translation systems. In this paper we show how training data can be expanded by paraphrasing one side of a parallel corpus. The new data is made by parsing then generating using an open-source, precise HPSG-based grammar. This gives sentences with the same meaning, but with minor variations in lexical choice and word order. In experiments paraphrasing the English in the Tanaka Corpus, a freely-available Japanese-English parallel corpus, we show consistent, statistically-significant gains on training data sets ranging from 10,000 to 147,000 sentence pairs in size as evaluated by the BLEU and METEOR automatic evaluation metrics.

著者関連情報
© 2010 by The Association for Natural Language Processing
前の記事
feedback
Top