人工知能学会論文誌
Online ISSN : 1346-8030
Print ISSN : 1346-0714
ISSN-L : 1346-0714
原著論文
レイアウト情報を用いたWeb ページの主要なDOM ノードの抽出法
鶴田 雅信増山 繁
著者情報
ジャーナル フリー

2010 年 25 巻 6 号 p. 742-756

詳細
抄録

We propose an informative DOM node extraction method from a Web page for preprocessing of Web content mining. Our proposed method LM uses layout data of DOM nodes generated by a generic Web browser, and the learning set consists of hundreds of Web pages and the annotations of informative DOM nodes of those Web pages. Our method does not require large scale crawling of the whole Web site to which the target Web page belongs. We design LM so that it uses the information of the learning set more efficiently in comparison to the existing method that uses the same learning set. By experiments, we evaluate the methods obtained by combining one that consists of the method for extracting the informative DOM node both the proposed method and the existing methods, and the existing noise elimination methods: Heur removes advertisements and link-lists by some heuristics and CE removes the DOM nodes existing in the Web pages in the same Web site to which the target Web page belongs. Experimental results show that 1) LM outperforms other methods for extracting the informative DOM node, 2) the combination method (LM, {CE(10), Heur}) based on LM (precision: 0.755, recall: 0.826, F-measure: 0.746) outperforms other combination methods.

著者関連情報
© 2010 JSAI (The Japanese Society for Artificial Intelligence)
前の記事
feedback
Top