2004 年 19 巻 6 号 p. 502-510
This paper proposes a new method for selecting fundamental vocabulary. We are presently constructing the Fundamental Vocabulary Knowledge-base of Japanese that contains integrated information on syntax, semantics and pragmatics, for the purposes of advanced natural language processing. This database mainly consists of a lexicon and a treebank: Lexeed (a Japanese Semantic Lexicon) and the Hinoki Treebank. Fundamental vocabulary selection is the first step in the construction of Lexeed. The vocabulary should include sufficient words to describe general concepts for self-expandability, and should not be prohibitively large to construct and maintain. There are two conventional methods for selecting fundamental vocabulary. The first is intuition-based selection by experts. This is the traditional method for making dictionaries. A weak point of this method is that the selection strongly depends on personal intuition. The second is corpus-based selection. This method is superior in objectivity to intuition-based selection, however, it is difficult to compile a sufficiently balanced corpora. We propose a psychologically-motivated selection method that adopts word familiarity as the selection criterion. Word familiarity is a rating that represents the familiarity of a word as a real number ranging from 1 (least familiar) to 7 (most familiar). We determined the word familiarity ratings statistically based on psychological experiments over 32 subjects. We selected about 30,000 words as the fundamental vocabulary, based on a minimum word familiarity threshold of 5. We also evaluated the vocabulary by comparing its word coverage with conventional intuition-based and corpus-based selection over dictionary definition sentences and novels, and demonstrated the superior coverage of our lexicon. Based on this, we conclude that the proposed method is superior to conventional methods for fundamental vocabulary selection.