Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
Morphological Analysis Based on A Maximum Entropy Model
An Approach to The Unknown Word Problem
KIYOTAKA UCHIMOTOSATOSHI SEKINEHITOSHI ISAHARA
Author information
JOURNAL FREE ACCESS

2001 Volume 8 Issue 1 Pages 127-141

Details
Abstract
Morphological analysis is one of the basic techniques used in Japanese sentence analysis. A morpheme is defined as the minimal grammatical unit such as a word or a suffix Morphological analysis is the process segmenting a given sentence into a row of morphemes and assigning to each morpheme grammatical attributes such as a part-of-speech (POS) and an inflection type. Recently, one of the most important issues in morphological analysis has become how to deal with unknown words, or words which are not found in a dictionary or a training corpus. So far, there have been mainly two statistical approaches for coping with this issue. One is the method of acquiring unknown words from corpora and incorporating them into a dictionary. The other is the method of estimating a model which can recognize unknown words correctly. We would like to be able to make good use of both approaches. If words acquired by the former method could be added to a dictionary and a model developed by the latter method could consult the amended dictionary, then the model could be the best statistical model which has the potential to overcome the unknown word problem. In this paper, we propose a method for Japanese morphological analysis based on a maximum entropy (M. E.) model. This method uses a model which can not only consult a dictionary with a large amount of lexical information but also recognizes unknown words by learning certain characteristics. We focused on the information such as what types of characters are used in a string in order to learn these characteristics. The model has the potential to overcome the unknown word problem. The recall and precision of the identification of a morpheme segment and its major parts-of-speech were 95.80% and 95.09%, respectively, when using the Kyoto University corpus.
Content from these authors
© The Association for Natural Language Processing
Previous article Next article
feedback
Top