This paper proposes improving a stochastic Japanese morphological analyzer through a morpheme clustering and an amelioration of the unknown word model. As a morpheme clustering, we propose a method which ameliorates a morpheme-based
n-gram model into a class-based
n-gram model with cross entropy criterion. As an amelioration of the unknown word model, we propose a method to incorporate a given morpheme set, such as dictionary, into it. As the result of experiments on the EDR corpus, we observed improvements of the accuracy. The analyzer adopting both methods marked a higher accuracy than an anteriorly reported part-of-speech-based tri-gram model. This result tells us that our morphological analyzer is better than the previous one in terms of accuracy. In addition to these experiments, we compared our analyzer with a grammarian's intuition-based analyser. The experimental results have shown the error rate of the stochastic analyzer was meaningfully smaller than that of the heuristic analyzer. The stochastic approach to Japanese morphological analysis is of great advantage to the ad-hoc method in higher accuracy, as well as in facility of further organized improvements.
View full abstract