In this paper, we propose a new framework of statistical language modeling integrating syntactic statistics and lexical statistics. Our model consists of two submodels, the syntactic model and lexical model. The syntactic model reflects syntactic statistics, such as structural preferences, whereas the lexical model reflects lexical statistics, such as the occurrence of each word and word collocations. One of the characteristics of our model is that it learns both types of statistics separately, although many previous models learn them simultaneously. Learning each submodel separately enables us to use a different language source for different submodels, and to make understanding of each submodel's behavior much easier. We conducteda preliminary experiment, where our model was applied to the disambiguation of dependency structures of Japanese sentences. The syntactic model achieved 73.38%in Bunsetu phrase accuracy, which is 11.70 points above the baseline, and when incorporating the lexical model with the syntactic model, further 10.96 point gain was achieved, to 84.34%. Thus the contribution of lexical statistics for disambiguation is as great as that of syntactic statistics in our framework.
View full abstract