A New Document Clustering Method Based on Comparative Advantage

Jie Ji; Qiangfu Zhao; Ryouhei Shindo; Yousuke Kunishi

doi:10.14864/softscis.2008.0.1084.0

Abstract

Document clustering is the process to partition a set of unlabelled documents into some categories or clusters. To analyze the documents based on the clustering results, it is expected that all documents in each cluster have some shared concept. This shared concept is often represented as the centroid. K-means is a well-known algorithm for unsupervised clustering. It can cluster the document set to satisfy the minimum mean squared error (MSE) function. However, intuitively speaking, the centroid may not be able to represent a concept clearly because it is just the average of all documents in the same cluster. To represent a cluster more clearly, we expect that each cluster has a small set of representative key terms. Although many document clustering methods have been proposed in the literature, few of them deal with the key terms explicitly. In this study, we propose a new method for classifying the documents based on the concept of comparative advantage, and a new clustering algorithm for extracting important key terms. Experimental results show that the proposed method can generate better results in the sense that the overlap between the sets of representative terms of the clusters is smaller.

Content from these authors

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!