Improving Seeded k-Means Clustering with Deviation- and Entropy-Based Term Weightings

Uraiwan BUATOOM; Waree KONGPRAWECHNON; Thanaruk THEERAMUNKONG

doi:10.1587/transinf.2019IIP0017

Abstract

The outcome of document clustering depends on the scheme used to assign a weight to each term in a document. While recent works have tried to use distributions related to class to enhance the discrimination ability. It is worth exploring whether a deviation approach or an entropy approach is more effective. This paper presents a comparison between deviation-based distribution and entropy-based distribution as constraints in term weighting. In addition, their potential combinations are investigated to find optimal solutions in guiding the clustering process. In the experiments, the seeded k-means method is used for clustering, and the performances of deviation-based, entropy-based, and hybrid approaches, are analyzed using two English and one Thai text datasets. The result showed that the deviation-based distribution outperformed the entropy-based distribution, and a suitable combination of these distributions increases the clustering accuracy by 10%.

Content from these authors

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!