形態素解析結果から過分割を検出する統計的尺度

内山 将夫

doi:10.5715/jnlp.6.7_3

Abstract

This paper proposes a statistical measure for detecting over-segmentations, which are errors in segmentation where a morphological analyzer segments places which should not be segmented, in results of Japanese morphological analysis. Such a measure is useful because we can use detected over-segmentations for creating error correction rules or for removing remaining errors in manually debugged corpora. The measure proposed in this paper is based on the ratio of the probability of a whole string to that of the string being segmented into two parts. Therefore, the value of the measure is high when a given string is rarely segmented into two parts. Consequently, a string rated high by the measure is likely to contain over-segmentations. In the experiments, the measure detected over-segmentations in the results of rulebased morphological analyzers very precisely and it also detected remaining oversegmentations in manually debugged corpora. These results show that the proposed measure is useful for developing high quality Japanese morphological analyzers and for developing/debugging corpora.

Content from these authors

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!