AlGoRiThM 发布于二月 7, 2022 分享 发布于二月 7, 2022 论文阅读第二天,两天读了六篇论文,看来你们不一定见得到我了明天hhh 今日论文推荐:《Boilerplate Detection using Shallow Text Features》 作者: Christian Kohlschütter, Peter Fankhauser, Wolfgang Nejdl 内容简介:文章通过对于文本特征的分析,建立了一个语言模型来提取正文内容当中的主题内容。其中有很多对于网页结构的分析,值得一读。与此同时,boilerplate表现也挺好的,在准确度精确度方面都达到了相当高的水准。 文章重点: 1. In the field of Quantitative Linguistics, it is generally assumed that the text creation process can be modeled as urn trials at the level of various linguistic units such as phoneme, word, sentence, text segment, etc. and for several shallow features such as frequency, length, repeat rate, polysemy, and polysexuality. 2. Through our systematical analysis, we found that removing the words from the short text class alone already is a good strategy for cleaning boilerplate and that using a combination of multiple shallow text features achieves an almost perfect accuracy. To a large extent the detection of boilerplate text does not require any inter-document knowledge (frequency of text blocks, common page layout, etc.) nor any training at the token level. 3. the textual content on the Web can apparently be grouped into two classes, long text (most likely the actual content) and short text (most likely navigational boilerplate text) respectively. 注释 Eternalcycle 40.00节操 糖 1 链接到点评
AlGoRiThM 发布于二月 11, 2022 作者 分享 发布于二月 11, 2022 于 2022/2/10 于 PM4点10分, mamamama 说道: 我猜不会有人不以娱乐为目的去看ss上分享的论文 难道这不快乐吗 链接到点评
推荐贴