擬似確率的単語分割コーパスによる言語モデルの改良

森 信介; 小田 裕樹

doi:10.5715/jnlp.16.5_7

Abstract

Language model (LM) building needs a corpus whose sentences are segmented into words. For languages in which the words are not delimited by whitespace, an automatic word segmenter built from a general domain corpus is used. Automatically segmented sentences, however, contain many segmentation errors especially around words and expressions belonging to the target domain. To cope with segmentation errors, the concept of stochastic segmentation has been proposed. In this framework, a corpus is annotated with word boundary probabilities that a word boundary exists between two characters. In this paper, first we propose a method to estimate word boundary probabilities based on an maximum entropy model. Next we propose a method for simulating a stochastically segmented corpus by a segmented corpus and show that the computational cost is drastically reduced without a performance degradation.

Content from these authors

Licensed under CC BY 4.0
https://creativecommons.org/licenses/by/4.0/

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!