Clustering Large Sparse Text Data: A Comparative Advantage Approach

Jie Ji; Tony Y. T. Chan; Qiangfu Zhao

doi:10.2197/ipsjjip.18.242

Abstract

Document clustering is the process of partitioning a set of unlabeled documents into clusters such that documents within each cluster share some common concepts. To analyze the clusters easily, it is convenient to represent the concepts using some key terms. However, by using terms as features, text data is represented in a very high-dimensional vector space, and the computational cost is high. Note that the text data are of high sparsity, and not all weights in the centers are important for classification. Based on this observation, we propose in this study a comparative advantage-based clustering algorithm which can find out the relative strength between clusters, as well as keep and enlarge their strength. Since the vectors are represented by term frequency, the clustering results are more comprehensible compared with dimensionality reduction methods. Experimental results show that the proposed algorithm can keep the characteristic of k-means algorithm, but the computational cost is much lower. Moreover, we also found that the proposed method has a higher chance of getting better results.

Content from these authors

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!