2017 Volume 12 Pages 291-315
Language modeling is a fundamental research problem that has wide application for many NLP tasks. For estimating probabilities of natural language sentences, most research on language modeling use n-gram based approaches to factor sentence probabilities. However, the assumption under n-gram models is not robust enough to cope with the data sparseness problem, which affects the final performance of language models. In this paper, we propose a generalized hierarchical word sequence framework, where different word association scores can be adopted to rearrange word sequences in a totally unsupervised fashion. Unlike the n-gram which factors sentence probability from left-to-right, our model factors using a more flexible strategy. For evaluation, we compare our rearranged word sequences to normal n-gram word sequences. Both intrinsic and extrinsic experiments verify that our language model can achieve better performance, proving that our method can be considered as a better alternative for n-gram language models.