Authoritative K-Means for Clustering of Web Search Results

Clustering is currently more and more applied on hyperlinked documents, especially for web search results. Although most commercial web search engines will provide their ranking algorithms sorting the matched results to raise the most relevant pages to the top, the size of results is still so huge that most ones including some pages that suffers are really interested in will be discarded. Clustering for web search results separates unrelated pages and clusters the similar pages with the same topic into the same group, thus helps suffers to locate the pages much faster. Many features of web pages have been studied to be used in clustering, such as content information including title, snippet, anchor text and etc. Hyperlink is another primary feature of web pages, some content-link coupled clustering methods have been studied. We propose an authoritative K-Means clustering method that combines content, in-link, out-link and page rank. In this project, we adjust the construction of in-link and out-link vectors and introduce a new page rank vector with two patterns, one is a single value representation of page rank and the other is a 11-dimensional vector. We study the difference of these two types of page rank in clustering, and compare the different clustering based on different web page representations, such as content-based, content-link coupled and etc. The effect of different elements of web page is also studied in our project. We apply the authoritative clustering for the web search results retrieved from Google search engine. Three experiments are conducted and different evaluation metrics are adopted to analyze the results.

Publisher

Institutt for datateknikk og informasjonsvitenskap