We propose an informative DOM node extraction method from a Web page for preprocessing of Web content mining. Our proposed method
LM uses layout data of DOM nodes generated by a generic Web browser, and the learning set consists of hundreds of Web pages and the annotations of informative DOM nodes of those Web pages. Our method does not require large scale crawling of the whole Web site to which the target Web page belongs. We design
LM so that it uses the information of the learning set more efficiently in comparison to the existing method that uses the same learning set. By experiments, we evaluate the methods obtained by combining one that consists of the method for extracting the informative DOM node both the proposed method and the existing methods, and the existing noise elimination methods:
Heur removes advertisements and link-lists by some heuristics and
CE removes the DOM nodes existing in the Web pages in the same Web site to which the target Web page belongs. Experimental results show that 1)
LM outperforms other methods for extracting the informative DOM node, 2) the combination method (
LM, {
CE(10),
Heur}) based on
LM (precision: 0.755, recall: 0.826, F-measure: 0.746) outperforms other combination methods.
抄録全体を表示