This paper deals with a word-based language model of Japanese. In Japanese, word boundaries are not stable and grammatical units do not necessarily coincide with human intuition. For accurate segmentation it is therefore necessary to create a vocabulary set that covers human utterance units. In our word-segmentation method, a model of word boundary is described by morphological parameters (i. e. part of speech), which are learned by comparing results of human segmentation with those of Japanese morphological analyzer. Then by using pseudo-random number and the model, it is determined whether each morpheme transition is a word boundary. As a result, we obtain a vocabulary set and learning data for Japanese language model automatically. According to our experiments using articles from three newspaper and appended texts in network-based forums, about 44, 000 words cover 94-98% of all words in the test data, and the average numbers of words per sentence are 12-19% smaller than those of morphemes. The parameters of word segmentation model and language model are quite different in newspaper articles and forum's texts. However, the difference does not exist in the probabilities of common events, but in the kinds of events. Therefore the language model, which was created from newspaper articles and forum's text, gave the satisfactory results for both test set.
View full abstract