Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
A Japanese Word Segmenter by a Character Class Model
HIROKI ODASHINSUKE MORIKENJI KITA
Author information
JOURNAL FREE ACCESS

1999 Volume 6 Issue 7 Pages 93-108

Details
Abstract

Word segmentation, which segments an input sentence into words, is the most fundamental process of Japanese language processing. In this paper, we present a new method for Japanese word segmentation based on a character class model. The character class model is more robust than a character-based model because the number of parameters of the character class model is fewer than that of a character-based. model. The measurement for Japanese character clustering is the entropy on a corpus different from the corpus for model estimation and the search method is based on the greedy algorithm. For this reason, this clustering method gives us an optimum character classification without giving the number of classes. As the result of experiments on the ADD (ATR Dialogue Database) corpus, the proposed Japanese word segmenter using the character class model marked a higher accuracy than a character-based model. In particular, the proposed method using a variable-length n-gram class model achieved 96.38% recall and 96.23% precision for open text.

Content from these authors
© The Association for Natural Language Processing
Previous article Next article
feedback
Top