文字列を特徴量とし反復度を用いたテキスト分類

尾上 徹; 平田 勝大; 岡部 正幸; 梅村 恭司

doi:10.5715/jnlp.17.1_77

Abstract

Feature selection for text classification is a procedure that categorizes words or strings to improve the classification performance. This operation is especially important when we use substrings as a feature because the number of substrings in a given data set is usually quite large.
In this paper, we focus on the substring feature selection technique and describe a method that uses a statistic score called “adaptation” as a measure for the selection. Adaptation works on the assumption that strings appearing more than twice in a document have a high probability of being keywords; we expect this feature to be an effective tool for text classification. We compared our method with a state-of-the-art method proposed by Zhang et al. that identifies a substring feature set by removing redundant substrings that are similar in terms of statistical distribution. We evaluate the classification results by F-measure that is a harmonic mean of precision and recall. An experiment on news classification demonstrated that our method outperformed Zhang’s by 3.74% (it improves Zhang’s result from 79.65% to 83.39%) on average based on the classification results. In addition, an experiment on spam classification demonstrated that our method outperformed Zhang’s by 2.93% (it improves the Zhang’s result from 90.23% to 93.15%). We verified existence of significant difference between the results in each experiment.
An experiment on news classification shows that our method is worse than a method of using word for feature by 0.49% (although there is no significant difference) on average based on the classification results. In addition, an experiment on spam classification demonstrated that our method outperformed the word method by 1.04% (our method improves its result from 92.11% to 93.15%). We verified that there is a significant difference between the results in spam classification experiment.
Zhang’s method tends to extract substrings that are so short it is difficult to understand the original phrases from which they are extracted. This degrades classification performance because such a substring can be a part of many different words, some or most of which are unrelated to the original substring. Our method, on the other hand, avoids this pitfall because it selects substrings containing a limited number of original words. Selecting substrings in such a manner is the key advantage of our method.

Content from these authors

Licensed under CC BY 4.0
https://creativecommons.org/licenses/by/4.0/

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!