Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
Language Model Adaptation with a Word List and a Raw Corpus
SHINSUKE MORI
Author information
JOURNAL FREE ACCESS

2006 Volume 13 Issue 4 Pages 33-47

Details
Abstract
In this paper, we discuss stochastic language model adaptation methods given a word list and a raw corpus.In this situation, a general method is to segment the raw corpus by a word segmenter equipped with a word list, correct the output sentences annotated with word boundary information by hand, and build a model from the segmented corpus.In this sentence-by-sentence error correction method, however, the annotator encounters difficult points and this results in a decrease of the productivity. In addition, it is not sure that sentence-by-sentence error correction from the beginning is the best way to dispense a limited work force.In this paper, we propose to take a word as a correction unit and concentrically correct the positions in which words in the list appear.This method allows us to avoid the above difficulty and go straight to capture the statistical behavior of specific words in the application field. In the experiments, we compared the language models built by several methods from the corpora in predictive power and Kana-kanji conversion accuracy.The results showed that concentrating on the error correction around the words in the list, we can build a better language model with less effort.
Content from these authors
© The Association for Natural Language Processing
Previous article Next article
feedback
Top