Recently, various types of learner corpora have been compiled and utilized for linguistic and educational research. As web-based application programs have been developed for language learners, we can now collect a large amount of language learners’ output on the web. These learner corpora include not only correct sentences but also incorrect ones, and we aim to take advantage of the latter for linguistic and educational research. To this end, this study aims to automatically classify incorrect sentences written by learners of Japanese according to error types (or classes) by a machine-learning method. First, we annotate a corpus of the learners’ writing with error types defined in a tree-structured class set. Second, we implement a hierarchical error-type classification model using the tree-structured class set. As a result, the proposed method performs better in the error-classification task than in the flat-structured multiclass classification baseline model by 13 points. Third, we explore features for error-type classification tasks. We use contextual information and syntactic information, such as dependency relations, as the baseline features. In addition, because a corpus of language learners contains not only correct sentences but also incorrect ones, we propose two extended features: the edit distance between correct usages and incorrect ones and the substitution probability at which characters in a sequence change to other characters. Although the performance varies according to error types, the proposed model with all features outperforms the model with the baseline features by six points.
View full abstract