转跳到内容

【快乐的论文分享第四期】基于网页DOM Tree的网页内容提取


推荐贴

论文阅读第二天,两天读了六篇论文,看来你们不一定见得到我了明天hhh

今日论文推荐:《Boilerplate Detection using Shallow Text Features》

作者: Christian Kohlschütter, Peter Fankhauser, Wolfgang Nejdl

 

内容简介:文章通过对于文本特征的分析,建立了一个语言模型来提取正文内容当中的主题内容。其中有很多对于网页结构的分析,值得一读。与此同时,boilerplate表现也挺好的,在准确度精确度方面都达到了相当高的水准。

 

文章重点:

1. In the field of Quantitative Linguistics, it is generally assumed that the text creation process can be modeled as urn trials at the level of various linguistic units such as phoneme, word, sentence, text segment, etc. and for several shallow features such as frequency, length, repeat rate, polysemy, and polysexuality.

2. Through our systematical analysis, we found that removing the words from the short text class alone already is a good strategy for cleaning boilerplate and that using a combination of multiple shallow text features achieves an almost perfect accuracy. To a large extent the detection of boilerplate text does not require any inter-document knowledge (frequency of text blocks, common page layout, etc.) nor any training at the token level.

3. the textual content on the Web can apparently be grouped into two classes, long text (most likely the actual content) and short text (most likely navigational boilerplate text) respectively.

 

注释
Eternalcycle Eternalcycle 40.00节操
链接到点评
  • 骚男锁定了本主题
游客
此主题已关闭。
×
×
  • 新建...

重要消息

为使您更好地使用该站点,请仔细阅读以下内容: 使用条款