最大エントロピーモデルに基づく形態素解析 未知語の問題の解決策

内元 清貴; 関根 聡; 井佐原 均

doi:10.5715/jnlp.8.127

Abstract

Morphological analysis is one of the basic techniques used in Japanese sentence analysis. A morpheme is defined as the minimal grammatical unit such as a word or a suffix Morphological analysis is the process segmenting a given sentence into a row of morphemes and assigning to each morpheme grammatical attributes such as a part-of-speech (POS) and an inflection type. Recently, one of the most important issues in morphological analysis has become how to deal with unknown words, or words which are not found in a dictionary or a training corpus. So far, there have been mainly two statistical approaches for coping with this issue. One is the method of acquiring unknown words from corpora and incorporating them into a dictionary. The other is the method of estimating a model which can recognize unknown words correctly. We would like to be able to make good use of both approaches. If words acquired by the former method could be added to a dictionary and a model developed by the latter method could consult the amended dictionary, then the model could be the best statistical model which has the potential to overcome the unknown word problem. In this paper, we propose a method for Japanese morphological analysis based on a maximum entropy (M. E.) model. This method uses a model which can not only consult a dictionary with a large amount of lexical information but also recognizes unknown words by learning certain characteristics. We focused on the information such as what types of characters are used in a string in order to learn these characteristics. The model has the potential to overcome the unknown word problem. The recall and precision of the identification of a morpheme segment and its major parts-of-speech were 95.80% and 95.09%, respectively, when using the Kyoto University corpus.

Content from these authors

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!