本来是想把笔记po到notion的,但是想了想,还是来一起受苦吧。
NLP预处理方面论文,基于DOM Tree的内容提取
我这几天预计要一天一个帖子,如果有一天我没有来请赶快骂我,谢谢XD
与各位共勉,研究加油啊~
今日论文题目:Content Extraction Using Diverse Feature Sets (2013)
推荐理由:对于通过对于在网页中的标签等进行机器学习,对于网站内容主体进行提取
精彩亮点:
We use the method in [4] to compute the F1-scores, where each word in the document is distinct even if two words are lexically the same. To demonstrate the versatility the learning approach, we train only on the 2012 Train set and make predictions on the rest of the data. In general, combining features does improve model performance, even if the individual model performance is poor. Model performance decreases on the newer 2012 data when compared to the older data sets. Individually, the IC features give a small performance improvement over the baseline, and not surprisingly perform poorly on the older data when CSS was less popular. The low individual performance of the IC features may be attributable to the fact that we accumulate tokens in each block, but meaningful tokens may appear outside the block at higher levels in the DOM. The small train/test differences suggest we may be slightly overfitting.