Mixture Probabilistic Context-Free Grammar An Improvement of a Probabilistic Context-Free Grammar Using Cluster-Based Language Modeling

Kenji Kita

doi:10.5715/jnlp.3.4_103

Abstract

This paper proposes an improved probabilistic CFG (Context-Free Grammar), called the mixture probabilistic CFG, based on an idea of cluster-based language modeling. This model assumes that the language model parameters have different probability distributions in different topics or domains. In order to performs topic-or domaindependent language modeling, we first divide the training corpus into a number of subcorpora according to their topics or domains, and then estimate separate probability distribution from each subcorpus. Therefore, a mixture probabilistic CFG has several different probability distributions for CFG productuions. The language model probability of a sentence is calculated as the mixture of these probability distributions. The mixture probabilistic CFG enables us to make a context-or topic-dependent language model, and thus accurate language modeling would be possible. The proposed model was evaluated by calculating test-set perplexity using the ADD (ATR Dialogue Database) corpus and a Japanese intra-phrase grammar. The mixture probabilistic CFG had a test-set perplexity of 2.47/phone, while simple probabilistic CFG had a test-set perplexity of 2.77/phone. We also conducted speech recognition experiments using three language models, including pure CFG (without probabilities), simple probabilistic CFG, and the mixture probabilistic CFG. In our experiments, the mixture probabilistic CFG attained the best performance. The proposed model was also evaluated using sentence-level clustering. This evaluation used the dialogue corpus in which each utterance is annotated with an utterance type called IFT (Illocutionary Force Type). Using these IFTs, we divided the corpus into 9 clusters, and then estimated production probabilities from these clusters. Without IFT clustering, the perplexity was 2.18 per phone, but using IFT clustering, the perplexity was reduced to 1.82 per phone.

Content from these authors

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!