Generative word alignment models, such as IBM Models, are restricted to one-to-many alignment, and cannot explicitly represent many-to-many relationships in bilingual texts. The problem is partially solved either by introducing heuristics or by agreement constraints such that two directional word alignments agree with each other. However, this constraint cannot take into account the grammatical difference of language pairs. In particular, function words are not trivial to align for grammatically different language pairs, such as Japanese and English. In this paper, we focus on the posterior regularization framework (Ganchev, Graca, Gillenwater, and Taskar 2010) that can force two directional word alignment models to agree with each other during training, and propose new constraints that can take into account the difference between function words and content words. We discriminate a function word and a content word using word frequency in the same way as done by Setiawan, Kan, andLi (2007). Experimental results show that our proposed constraints achieved better alignment qualities on the French-English Hansard task and the Japanese-English Kyoto free translation task (KFTT) measured by AER and F-measure. In translation evaluations, we achieved statistically significant gains in BLEU scores in the Japanese-English NTCIR10 task and Spanish-English WMT06 task.
In syntax-based machine translation, it is known that the accuracy of parsing greatly affects the translation accuracy. Self-training, which uses parser output as training data, is one method to improve the parser accuracy. However, because automatically generated parse trees often include errors, these parse trees do not always contribute to improving accuracy. In this paper, we propose a method for removing noisy incorrect parse trees from the training data to improve the effect of self-training by using automatic evaluation metrics of translations. Specifically, we perform syntax-based machine translation using n-best parse trees, then we re-scoring parse trees based on the automatic evaluation score of translations. By using the parse trees that have higher score among the candidates for self-training, we can improve parsing and machine translation accuracy by using parallel corpora that are not annotated syntax structure. In experiments, using higher score parse trees for self-training, we found that our self-trained parsers significantly improve a state-of-the-art syntax-based machine translation system in two language pairs, and self-trained parsers significantly improve the accuracy of the parsing itself.