Abstract
Automatic evaluation of Machine Translation (MT) quality is essential to develop high-quality MT systems. Various evaluation metrics have proposed, and among them, BLEU is widely used as the de facto standard metric. BLEU counts N-grams common between reference and hypothesis translation. On the other hand, ROUGE-L counts longest common subsequences. However, these methods have some problems. People give high scores to Rule-based MT (RBMT), but these methods do not, because RBMT tends to use alternative words. Conventional metrics are severe against the difference of words, but people accept them if the translation has the same meaning. Statistical MT (SMT) tends to translate “A because B” as “B because A” in case of translation between Japanese and English. BLEU does not care about global word order, and this severe mistake is not penalized very much. In order to consider global word order, this paper proposes a lenient automatic evaluation metric based on rank correlation of word order. By focusing on only words common between the two translations, this method is lenient with the use of alternative words. The difference of words is measured by precision of words, and its weight is controlled by a parameter. By using submissions of NTCIR-7 & 9’s Patent Translation task, the proposed method outperforms conventional measures in terms of system level comparison.