Abstract
This paper proposes a spam filtering method that utilizes active
learning and feature identification. Identification of effective
features are very important procedure in spam filtering because spam
mail includes so much meaningless words that are slightly different
from each other. Those words bring down much calculation cost and
performance reduction in filtering process. Thus identifying effective
and ineffective features is promising approach in spam
filtering. However traditional feature selection methods calculate the
score of features based on some amount of labeled training data. This
assumption does not hold in the situation of spam filtering. Spam
filtering process starts with non or few labeled data, and gradually
increases labeled data using user feedback. We propose a
method to identify effective features through this active learning
process in spam filtering based on naive Bayes approach.
Experimental results show that our method outperforms traditional method
with no feature identification.