Cumulative Citation Recommendation

Keeping knowledge bases such as Wikipedia up-to-date with the latest information is a difﬁcult task in the information age: Every day thousands of news articles, blog posts, opinions are published on the Internet and if we imagine that just a small fraction of these documents contain new information that would require a knowledge base to be updated, then we need an army of constantly vigilant volunteers to keep track of this stream of information and update knowledge bases as it becomes necessary. Obviously as more more information is generated on the Internet, we need increasingly more volunteers to keep track of it all. It would then be greatly beneﬁcial if we could create automated systems which assist volunteers with integrating new information into knowledge bases.

Cumulative Citation Recommendation (CCR) is the task of assisting knowledge base editors by automatically recommending edits to entity proﬁles in knowledge bases given a stream of documents. In this thesis we implement a CCR system that allow us to evaluate different learning-to-rank (LTR) based ranking approaches to CCR. Speciﬁcally we compare entity-dependent and entity-independent approaches, as well as approaches which use Gradient Boosted Trees and Random Forests as the ranking algorithm. We also evaluate how different features affect the system. Our best approach which uses Gradient Boosted Trees and an entity-dependent approach achieves an F1 measure of 0.5 on the 2014 TREC KBA track, which would places it in second place compared to other participants of this track. Our evaluation of different LTR-based approaches reveal which approaches are most effective for CCR.

Utgiver

NTNU