Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
Paper
Language Model Improvement by a Pseudo-Stochastically Segmented Corpus
Shinsuke MoriHiroki Oda
Author information
JOURNAL FREE ACCESS

2009 Volume 16 Issue 5 Pages 5_7-5_21

Details
Abstract
Language model (LM) building needs a corpus whose sentences are segmented into words. For languages in which the words are not delimited by whitespace, an automatic word segmenter built from a general domain corpus is used. Automatically segmented sentences, however, contain many segmentation errors especially around words and expressions belonging to the target domain. To cope with segmentation errors, the concept of stochastic segmentation has been proposed. In this framework, a corpus is annotated with word boundary probabilities that a word boundary exists between two characters. In this paper, first we propose a method to estimate word boundary probabilities based on an maximum entropy model. Next we propose a method for simulating a stochastically segmented corpus by a segmented corpus and show that the computational cost is drastically reduced without a performance degradation.
Content from these authors
© 2009 The Association for Natural Language Processing
Previous article Next article
feedback
Top