• ISSN 0258-2724
  • CN 51-1277/U
  • EI Compendex
  • Scopus
  • Indexed by Core Journals of China, Chinese S&T Journal Citation Reports
  • Chinese S&T Journal Citation Reports
  • Chinese Science Citation Database
Volume 20 Issue 4
Aug.  2007
Turn off MathJax
Article Contents
GAO Yan, GU Shiwen, TAN Liqiu. Web Content Extraction Based on Multiple Strategies[J]. Journal of Southwest Jiaotong University, 2007, 20(4): 473-477.
Citation: GAO Yan, GU Shiwen, TAN Liqiu. Web Content Extraction Based on Multiple Strategies[J]. Journal of Southwest Jiaotong University, 2007, 20(4): 473-477.

Web Content Extraction Based on Multiple Strategies

  • Received Date: 14 Jun 2006
  • Publish Date: 25 Aug 2007
  • In order to filter the noise in a web page,a new multi-strategy algorithm to extract the contents of a web page was proposed.With this algorithm,the granularity in different areas of the block tree of a web page established by the improved VIPS(visual based page segment) algorithm is controlled by defining the permitted degree of coherence and the maximum depth of the block tree.In addition,"topic" or "topic-relevant" blocks among the leaves of the block tree can be extracted from the blocks’ content information and structure information.Finally,the main content of a web page can be extracted by merging these blocks’ contents.Experiments on the web pages of three sites indicates that the proposed algorithm is effective for extracting the contents of any type of web pages.

     

  • loading
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索
    Article views(1298) PDF downloads(422) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return