Abstract
This paper proposes and evaluates a method for extracting personal web pages from a large number of unclassified web pages. We can use the method as a content filtering method for reputation searches. To extract personal pages from unclassified pages, the method focuses on four kinds of text features that appear at a personal page. The method quantitatively measures these features for each page and divides the pages into plural groups using k-means clustering based on the results of the measuring. From the groups the method finds groups that consist of personal web pages. We have evaluated the search performance of the method by measuring precisions. Experimental results have shown the average performance of the method is 2.1-times higher than the one of a keyword-based search engine.