An Idea of a Rough Set Theory Based Document Classification System

Masaki KUREMATSU

doi:10.3156/jsoft.32.4_778

Abstract

The document classification task is a well-known task for natural language processing. In this paper, I propose a Rough Set Theory based document classification system. First, the proposed system makes a decision table by combining the label of the document and terms extracted by the document frequency and reduction. Next, it extracts decision rules from upper approximation and lower approximation, respectively. Then it matches an unlabeled document to both decision rules and extracts a label which has the maximum value of the sum of rules’ weight. I use SI (Satisfaction Index), CI (Coverage Index) and Lift as the rules’ weight. In order to evaluate this approach, I implemented a prototype system and tried to classify labeled patent publications in Japanese with experts. This system could extract some rules evaluated as useful by an expert and shows its accuracy rate is higher than by selecting the modal label. However, the rate of the useful rules is only 25% and the accuracy rate and the Kappa statistics are not enough to use. This result cannot also say this approach is better than Naive Bayes Classifier. In the next study, I improve this approach based on the analysis of this evaluation.

Content from these authors

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!