Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
Paper
Automatic Word Segmentation using Three Types of Dictionaries
Shinsuke MoriHiroki Oda
Author information
JOURNAL FREE ACCESS

2011 Volume 18 Issue 2 Pages 139-152

Details
Abstract
In this paper we propose a new method for automatically segmenting a sentence in Japanese into a word sequence. The main advantage of our method is that the segmenter is, by using a maximum entropy framework, capable of referring to a list of compound words, i.e. word sequences without boundary information. This allows for a higher segmentation accuracy in many real situations where only some electronic dictionaries, whose entries are not consistent with the word segmentation standard, are available. Our method is also capable of exploiting a list of word sequences. It allows us to obtain a far greater accuracy gain with low manual annotation cost.
We prepared segmented corpora, a compound word list, and a word sequence list. Then we conducted experiments to compare automatic word segmenters referring to various types of dictionaries. The results showed that the word segmenter we proposed is capable of exploiting a list of compound words and word sequences to yield a higher accuracy under realistic situations.
Content from these authors
© 2011 The Association for Natural Language Processing
Previous article Next article
feedback
Top