2020 Volume E103.D Issue 4 Pages 748-758
The outcome of document clustering depends on the scheme used to assign a weight to each term in a document. While recent works have tried to use distributions related to class to enhance the discrimination ability. It is worth exploring whether a deviation approach or an entropy approach is more effective. This paper presents a comparison between deviation-based distribution and entropy-based distribution as constraints in term weighting. In addition, their potential combinations are investigated to find optimal solutions in guiding the clustering process. In the experiments, the seeded k-means method is used for clustering, and the performances of deviation-based, entropy-based, and hybrid approaches, are analyzed using two English and one Thai text datasets. The result showed that the deviation-based distribution outperformed the entropy-based distribution, and a suitable combination of these distributions increases the clustering accuracy by 10%.