It is well known that direct parsing of a long Japanese sentence, including many conjunctive clauses, is extremely difficult. Therefore, it is preferable to segment such a sentence into shorter, simpler ones prior to parsing. Some methods for sentence segmentation have been reported so far. However, because those conventional methods are based on handmade segmentation patterns or rules, they have problems in keeping consistency of the patterns, and in deciding the optimal order of applying those rules. This paper proposes a new method of sentence segmentation using a decision tree, which acquires optimal segmentation patterns and the optimal order of their application automatically from a corpus, taking both linguistic phenomena and their occurrence frequencies into account. Generation and evaluation of a decision tree for sentence segmentation were conducted on an EDR corpus. For 400 evaluation sentences, precision and recall were both 84%, and the percentage of correctly segmented sentences was 77%. It was also confirmed that pruning reduces the tree size significantly without deteriorating the performance.
View full abstract