Journal of Natural Language Processing

Sequence alignment, which involves aligning elements of two given sequences, occurs in many natural language processing (NLP) tasks such as sentence alignment. Previous approaches for solving sequence alignment problems in NLP can be categorized into two groups. The first group assumes monotonicity of alignments; the second group does not assume monotonicity or consider the continuity of alignments. However, for example, in aligning sentences of parallel legal documents, it is desirable to use a sentence alignment method that does not assume monotonicity but can consider continuity. Herein, we present a method to align sequences where block-wise changes in the order of sequence elements exist. Our method formalizes a sequence alignment problem as a set partitioning problem, a type of combinatorial optimization problem, and solves the problem to obtain an alignment. We also propose an efficient algorithm to solve the optimization problem by applying column generation.

View full abstract

Show abstractHide abstract

Recently, various types of learner corpora have been compiled and utilized for linguistic and educational research. As web-based application programs have been developed for language learners, we can now collect a large amount of language learners’ output on the web. These learner corpora include not only correct sentences but also incorrect ones, and we aim to take advantage of the latter for linguistic and educational research. To this end, this study aims to automatically classify incorrect sentences written by learners of Japanese according to error types (or classes) by a machine-learning method. First, we annotate a corpus of the learners’ writing with error types defined in a tree-structured class set. Second, we implement a hierarchical error-type classification model using the tree-structured class set. As a result, the proposed method performs better in the error-classification task than in the flat-structured multiclass classification baseline model by 13 points. Third, we explore features for error-type classification tasks. We use contextual information and syntactic information, such as dependency relations, as the baseline features. In addition, because a corpus of language learners contains not only correct sentences but also incorrect ones, we propose two extended features: the edit distance between correct usages and incorrect ones and the substitution probability at which characters in a sequence change to other characters. Although the performance varies according to error types, the proposed model with all features outperforms the model with the baseline features by six points.

View full abstract

Download PDF (669K)

Register with J-STAGE for free!