Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
Statistical Measure for Detecting Over-Segmentations in Results of Japanese Morphological Analysis
MASAO UTIYAMA
Author information
JOURNAL FREE ACCESS

1999 Volume 6 Issue 7 Pages 3-28

Details
Abstract

This paper proposes a statistical measure for detecting over-segmentations, which are errors in segmentation where a morphological analyzer segments places which should not be segmented, in results of Japanese morphological analysis. Such a measure is useful because we can use detected over-segmentations for creating error correction rules or for removing remaining errors in manually debugged corpora. The measure proposed in this paper is based on the ratio of the probability of a whole string to that of the string being segmented into two parts. Therefore, the value of the measure is high when a given string is rarely segmented into two parts. Consequently, a string rated high by the measure is likely to contain over-segmentations. In the experiments, the measure detected over-segmentations in the results of rulebased morphological analyzers very precisely and it also detected remaining oversegmentations in manually debugged corpora. These results show that the proposed measure is useful for developing high quality Japanese morphological analyzers and for developing/debugging corpora.

Content from these authors
© The Association for Natural Language Processing
Previous article Next article
feedback
Top