Abstract
In order to reduce the dimension of VSM (Vector Space Model) for information retrieval and clustering, this paper proposes a new method, Semantic-VSM, which uses the Semantic Attribute System defined by “A-Japanese-Lexicon” instead of literal words used in conventional VSM. The attribute system consists of a tree structure with 2, 710 attributes, which includes 400 thousand literal words. Using this attribute system, the generalization of vector elements can be performed easily based on upper-lower relationships of semantic attributes, so that the dimension can easily be reduced at very low cost. Synonyms are automatically assessed through semantic attributes to improve the recall performance of retrieval systems. Experimental results applying it to BMIR-J2 database of 5, 079 newspaper articles showed that the dimension can be reduced from 2, 710 to 300 or 600 with only a small degradation in performance. High recall performance was also shown compared with conventional VSM.