Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
Article
Paraphrasing Training Data for Statistical Machine Translation
Eric NicholsFrancis BondD. Scott ApplingYuji Matsumoto
Author information
JOURNAL FREE ACCESS

2010 Volume 17 Issue 3 Pages 3_101-3_122

Details
Abstract
Large amounts of data are essential for training statistical machine translation systems. In this paper we show how training data can be expanded by paraphrasing one side of a parallel corpus. The new data is made by parsing then generating using an open-source, precise HPSG-based grammar. This gives sentences with the same meaning, but with minor variations in lexical choice and word order. In experiments paraphrasing the English in the Tanaka Corpus, a freely-available Japanese-English parallel corpus, we show consistent, statistically-significant gains on training data sets ranging from 10,000 to 147,000 sentence pairs in size as evaluated by the BLEU and METEOR automatic evaluation metrics.
Content from these authors
© 2010 The Association for Natural Language Processing
Previous article
feedback
Top