Toward the realization of a natural language understanding system for clinical records, the authors have analyzed a large number of discharge summaries (a kind of clinical record). In the records many Japanese compound nouns appear due to ellipsis. Therefore, it is very essential to the understanding system to cope with them. This paper describes a system to paraphrase compound nouns by restoring their elliptical constructions in use of their semantic categorie categories (Yokota, Nishimura, Shiraishi and Ryu 1994) according to the Mental-image directed semantic theory (Yokota 1988; Yokota, Shiraishi, Ryu, and Oda 1991b).This system consists of four major processors: “Word segmentation processor, ” “Restoration processor, ” “Hierarchical relation detector” and “Sentence generator”, and possesses two types of dictionary: “Word dictionary” and “Hierarchy dictionarv”. The fbrmer of the dictionaries assigns a semantic category, etc. to each noun, and the latter contains the hierarchic relations among the concepts of objects (one of the semantic categories of nouns). The experimental result of the system has proven to be fairly successful.
We address the problem of automatically constructing a thesaurus (hierarchically clustering words) based on corpus data. We view the problem of clustering words as that of estimating a joint distribution over the Cartesian product of a partition of a set of nouns and a partition of a set of verbs, and propose an estimation algorithm using simulated annealing with an energy function based on the Minimum Description Length (MDL) Principle. We empirically compared the performance of our method based on the MDL Principle against a method based on the Maximum Likelihood Estimator, and found that the former outperforms the latter. We also evaluated the method by conducting pp-attachment disambiguation experiments using an automatically constructed thesaurus. Our experimental results indicate that we can improve accuracy in disambiguation by using such a thesaurus.
Word sense disambiguation has recently been utilized in corpus-based approaches, reflecting the growth in the number of machine readable texts.One category of approaches disambiguates an input verb sense based on the similarity between its governing case fillers and those in given examples. In this paper, we introduce the degree of case contribution to verb sense disambiguation into this existing method. In this, greater diversity of semantic range of case filler examples will lead to that case contributing to verb sense disambiguation more. We also report the result of a comparative experiment, in which the performance of disambiguation is improved by considering this notion of semantic contribution.