Abstract
Document clustering is the process to partition a
set of unlabelled documents into some categories or clusters.
To analyze the documents based on the clustering results, it is
expected that all documents in each cluster have some shared
concept. This shared concept is often represented as the centroid.
K-means is a well-known algorithm for unsupervised clustering.
It can cluster the document set to satisfy the minimum mean
squared error (MSE) function. However, intuitively speaking, the
centroid may not be able to represent a concept clearly because
it is just the average of all documents in the same cluster. To
represent a cluster more clearly, we expect that each cluster has a
small set of representative key terms. Although many document
clustering methods have been proposed in the literature, few of
them deal with the key terms explicitly. In this study, we propose
a new method for classifying the documents based on the concept
of comparative advantage, and a new clustering algorithm for
extracting important key terms. Experimental results show that
the proposed method can generate better results in the sense
that the overlap between the sets of representative terms of the
clusters is smaller.