1998 年 13 巻 3 号 p. 470-479
This paper discusses analytically and experimentally an optimal structure of a large scale knowledge base of words, which is automatically constructed from machine-readable dictionaries. In this knowledge base, each word is represented by a series of weighted keywords. The keywords have some relationship with the word, and the weights of the keywords represent the degree of the strength of the relationship between the word and keywords. In constructing this kind of knowledge base, it is important to select the optimal set of keywords used to represent every word in the knowledge base, considering the ability of measuring the semantic similarity between words. Our analysis, using a simplified model of the knowledge base based on probability theory, has shown that a smaller keyword set using the higher level keyword in the conceptual hierarchy becomes optimal when the size of the knowledge base, namely, the total number of words in it or the average number of keywords per word, becomes large. On the other hand, an experiment using six knowledge bases modified from the previously constructed knowledge base of 40000 Japanese daily-used words has verified the existence of the optimal keyword set. This means that the above mentioned analysis is useful in the design of a knowledge base in which each word is generally represented by a vector. In addition, we have found, from both a subjective evaluation based on human judgment and a newly proposed objective evaluation using a published synonym dictionary, that a set of about 2000 keywords is optimal for constructing a knowledge base of this size.