转跳到内容

AlGoRiThM

【净土】SS自购团
  • 内容数

    1,045
  • 加入

  • 最后访问

AlGoRiThM 发表的所有内容

  1. 去运动吧!睡觉的秘诀都藏在那里了!
  2. 我倒是没有和同学一起玩黄油 但是我和同学分享过小黄书XD
  3. 其实无所谓的,只要没有出现什么 漫画balabalabala之类,剧透到结尾的,我一切都好
  4. 论文阅读第二天,两天读了六篇论文,看来你们不一定见得到我了明天hhh 今日论文推荐:《Boilerplate Detection using Shallow Text Features》 作者: Christian Kohlschütter, Peter Fankhauser, Wolfgang Nejdl 内容简介:文章通过对于文本特征的分析,建立了一个语言模型来提取正文内容当中的主题内容。其中有很多对于网页结构的分析,值得一读。与此同时,boilerplate表现也挺好的,在准确度精确度方面都达到了相当高的水准。 文章重点: 1. In the field of Quantitative Linguistics, it is generally assumed that the text creation process can be modeled as urn trials at the level of various linguistic units such as phoneme, word, sentence, text segment, etc. and for several shallow features such as frequency, length, repeat rate, polysemy, and polysexuality. 2. Through our systematical analysis, we found that removing the words from the short text class alone already is a good strategy for cleaning boilerplate and that using a combination of multiple shallow text features achieves an almost perfect accuracy. To a large extent the detection of boilerplate text does not require any inter-document knowledge (frequency of text blocks, common page layout, etc.) nor any training at the token level. 3. the textual content on the Web can apparently be grouped into two classes, long text (most likely the actual content) and short text (most likely navigational boilerplate text) respectively.
  5. MBTI不被认可吗话说?我昨天才测试了MBTI那个,发现我更加自闭了233
  6. 本来是想把笔记po到notion的,但是想了想,还是来一起受苦吧。 NLP预处理方面论文,基于DOM Tree的内容提取 我这几天预计要一天一个帖子,如果有一天我没有来请赶快骂我,谢谢XD 与各位共勉,研究加油啊~ 今日论文题目:Content Extraction Using Diverse Feature Sets (2013) 推荐理由:对于通过对于在网页中的标签等进行机器学习,对于网站内容主体进行提取 精彩亮点: We use the method in [4] to compute the F1-scores, where each word in the document is distinct even if two words are lexically the same. To demonstrate the versatility the learning approach, we train only on the 2012 Train set and make predictions on the rest of the data. In general, combining features does improve model performance, even if the individual model performance is poor. Model performance decreases on the newer 2012 data when compared to the older data sets. Individually, the IC features give a small performance improvement over the baseline, and not surprisingly perform poorly on the older data when CSS was less popular. The low individual performance of the IC features may be attributable to the fact that we accumulate tokens in each block, but meaningful tokens may appear outside the block at higher levels in the DOM. The small train/test differences suggest we may be slightly overfitting.
  7. 如果放完血了,用水泡过了之后是可以的。因为血液里有很多细菌,所以不能隔夜。
  8. 好活,我去瞅瞅!不对啊,这玩意是有机化学?
  9. 论文标题:Trafilatura: A Web Scraping Library and Command-Line Toolfor Text Discovery and Extraction 论文领域:NLP / Machine Learning / Data Mining 推荐原因: 数据挖掘在数据分析/机器学习的数据收集过程中一个很重要的步骤。这个论文很有意思在于他对于网站结构进行了解析,并且研究出了一个对于大多数网站通用的model。由于还没读完,但是从下图可以看出来这个的表现是很不错的~ 如果有人有更好的模型也可以推荐推荐,让我cp一下我下周例会要说点啥hhh 真实原因: 导师让我读的,但总不能我一个人受苦不是~
  10. 不急不急,一直不急 反正我对于结婚什么的没有什么迫切的需求,甚至对于恋爱都没有需求 孤寡的挺好的
  11. 欢迎来到ss大学啊~ 希望有一个美好的一天~
  12. 祭奠我们逝去的一年呜呜呜 时间过得也太快了
  13. 说起网游,第一款网游我玩了的是新飞飞~甚至比摩尔庄园还早 不得不说这个游戏自由度可以超过现在市面上大部分游戏 中国游戏退化说
×
×
  • 新建...

重要消息

为使您更好地使用该站点,请仔细阅读以下内容: 使用条款