Grammatical error correction (GEC) systems have been typically evaluated on the single corpus: the CoNLL-2014 benchmark. However, evaluation remains incomplete because the task difficulty should vary depending on test corpus properties including the proficiency levels of the writers and essay topics. This study explores the necessity of cross-corpora evaluation for GEC systems based on the hypothesis that a single corpus evaluation is insufficient for evaluating GEC systems. Specifically, we evaluated performance of four GEC models (based on LSTM, CNN, Transformer and SMT) against six corpora (CoNLL-2013, CoNLL-2014, FCE, JFLEG, KJ and BEA-2019). Evaluation results revealed that model rankings vary considerably depending on the corpora, indicating that single-corpus evaluation is insufficient for GEC models. Moreover, cross-sectional evaluation is useful not only as a meta-evaluation method but also for practical applications. As a case study of the usefulness of cross-sectional evaluation, we investigated the cross-sectional evaluation of one of the typical conditions for input of grammatical error correction: writer’s proficiency level as a unit of evaluation. The results showed that there was a large divergence in the evaluation between the beginner-intermediate and advanced levels of writer’s proficiency.
View full abstract