IEICE Transactions on Information and Systems
Online ISSN : 1745-1361
Print ISSN : 0916-8532
Special Section on Intelligent Information and Communication Technology and its Applications to Creative Activity Support
Improving Seeded k-Means Clustering with Deviation- and Entropy-Based Term Weightings
Uraiwan BUATOOMWaree KONGPRAWECHNONThanaruk THEERAMUNKONG
Author information
JOURNAL FREE ACCESS

2020 Volume E103.D Issue 4 Pages 748-758

Details
Abstract

The outcome of document clustering depends on the scheme used to assign a weight to each term in a document. While recent works have tried to use distributions related to class to enhance the discrimination ability. It is worth exploring whether a deviation approach or an entropy approach is more effective. This paper presents a comparison between deviation-based distribution and entropy-based distribution as constraints in term weighting. In addition, their potential combinations are investigated to find optimal solutions in guiding the clustering process. In the experiments, the seeded k-means method is used for clustering, and the performances of deviation-based, entropy-based, and hybrid approaches, are analyzed using two English and one Thai text datasets. The result showed that the deviation-based distribution outperformed the entropy-based distribution, and a suitable combination of these distributions increases the clustering accuracy by 10%.

Content from these authors
© 2020 The Institute of Electronics, Information and Communication Engineers
Previous article Next article
feedback
Top