In this paper, we discuss stochastic language model adaptation methods given a word list and a raw corpus.In this situation, a general method is to segment the raw corpus by a word segmenter equipped with a word list, correct the output sentences annotated with word boundary information by hand, and build a model from the segmented corpus.In this sentence-by-sentence error correction method, however, the annotator encounters difficult points and this results in a decrease of the productivity. In addition, it is not sure that sentence-by-sentence error correction from the beginning is the best way to dispense a limited work force.In this paper, we propose to take a word as a correction unit and concentrically correct the positions in which words in the list appear.This method allows us to avoid the above difficulty and go straight to capture the statistical behavior of specific words in the application field. In the experiments, we compared the language models built by several methods from the corpora in predictive power and Kana-kanji conversion accuracy.The results showed that concentrating on the error correction around the words in the list, we can build a better language model with less effort.
View full abstract