Web Content Extraction Based on Multiple Strategies

GAO Yan; GU Shiwen; TAN Liqiu

Volume 20 Issue 4

Aug. 2007

Turn off MathJax

Article Contents

Journal of Southwest Jiaotong University > 2007 > 20(4): 473-477.

GAO Yan, GU Shiwen, TAN Liqiu. Web Content Extraction Based on Multiple Strategies[J]. Journal of Southwest Jiaotong University, 2007, 20(4): 473-477.

Citation:

GAO Yan, GU Shiwen, TAN Liqiu. Web Content Extraction Based on Multiple Strategies[J]. Journal of Southwest Jiaotong University, 2007, 20(4): 473-477.

GAO Yan, GU Shiwen, TAN Liqiu. Web Content Extraction Based on Multiple Strategies[J]. Journal of Southwest Jiaotong University, 2007, 20(4): 473-477.

Citation:

GAO Yan, GU Shiwen, TAN Liqiu. Web Content Extraction Based on Multiple Strategies[J]. Journal of Southwest Jiaotong University, 2007, 20(4): 473-477.

PDF( 0 KB)

Web Content Extraction Based on Multiple Strategies

Received Date: 14 Jun 2006
Publish Date: 25 Aug 2007

Abstract

Abstract

In order to filter the noise in a web page,a new multi-strategy algorithm to extract the contents of a web page was proposed.With this algorithm,the granularity in different areas of the block tree of a web page established by the improved VIPS(visual based page segment) algorithm is controlled by defining the permitted degree of coherence and the maximum depth of the block tree.In addition,"topic" or "topic-relevant" blocks among the leaves of the block tree can be extracted from the blocks’ content information and structure information.Finally,the main content of a web page can be extracted by merging these blocks’ contents.Experiments on the web pages of three sites indicates that the proposed algorithm is effective for extracting the contents of any type of web pages.
- VIPS(visual based page segment),
- degree of coherence,
- maximum depth,
- content information,
- structure information