Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
The Application of Classification Trees to Bunsetsu Segmentation of Japanese Sentences
Yujie ZhangKazuhiko Ozeki
Author information
JOURNAL FREE ACCESS

1998 Volume 5 Issue 4 Pages 17-33

Details
Abstract
In conventional bunsetsu segmentation methods for Japanese sentences, segmentation rules have been given manually. This causes difficulties in maintaining the consistency of the rules, and in deciding an efficient order of rule application. This paper proposes a method of automatic bunsetsu segmentation using a classification tree, by which knowledge about bunsetsu boundaries is automatically acquired from a corpus, and an efficient order of rule application is realized automatically. It can adapt quickly to a new system of parts of speech, and also to a new task domain without the need for changing the algorithm. Generation of classification trees for bunsetsu segmentation and evaluation experiments were carried out on an ATR corpus and an EDR corpus. The segmentation accuracy of 98.9% was achieved for the ATR corpus, and 96.2% for the EDR corpus. The method was compared with a simple rule-based method and the Bayes decision rule on the ATR corpus. The proposed method outperformed the rule-based method when the training data size was larger than about 20 sentences, and outperformed the Bayes decision rule over the whole range of training data sizes. The superiority of the proposed method was more evident over the former when the training data size was larger, and over the latter when the training data size was smaller.
Content from these authors
© The Association for Natural Language Processing
Previous article Next article
feedback
Top