人工知能学会論文誌
Online ISSN : 1346-8030
Print ISSN : 1346-0714
ISSN-L : 1346-0714
原著論文
Unsupervised Spam Detection by Document Probability Estimation with Maximal Overlap Method
Takashi UemuraDaisuke IkedaTakuya KidaHiroki Arimura
著者情報
ジャーナル フリー

2011 年 26 巻 1 号 p. 297-306

詳細
抄録

In this paper, we study content-based spam detection for spams that are generated by copying a seed document with some random perturbations. We propose an unsupervised detection algorithm based on an entropy-like measure called document complexity, which reflects how many similar documents exist in the input collection of documents. As the document complexity, however, is an ideal measure like Kolmogorov complexity, we substitute an estimated occurrence probability of each document for its complexity. We also present an efficient algorithm that estimates the probabilities of all documents in the collection in linear time to its total length. Experimental results showed that our algorithm especially works well for word salad spams, which are believed to be difficult to detect automatically.

著者関連情報
© 2011 JSAI (The Japanese Society for Artificial Intelligence)
前の記事
feedback
Top