文字クラスモデルによる日本語単語分割

小田 裕樹; 森 信介; 北 研二

doi:10.5715/jnlp.6.7_93

Abstract

Word segmentation, which segments an input sentence into words, is the most fundamental process of Japanese language processing. In this paper, we present a new method for Japanese word segmentation based on a character class model. The character class model is more robust than a character-based model because the number of parameters of the character class model is fewer than that of a character-based. model. The measurement for Japanese character clustering is the entropy on a corpus different from the corpus for model estimation and the search method is based on the greedy algorithm. For this reason, this clustering method gives us an optimum character classification without giving the number of classes. As the result of experiments on the ADD (ATR Dialogue Database) corpus, the proposed Japanese word segmenter using the character class model marked a higher accuracy than a character-based model. In particular, the proposed method using a variable-length n-gram class model achieved 96.38% recall and 96.23% precision for open text.

Content from these authors

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!