Path-based Graph Indexing for Keyword Search on RDF data - Discovering Concepts Through Community Detection
MetadataVis full innførsel
RDF graphs are structured data that is designed to be used by computers for reasoning. However, such data may be of interest to people as well. Therefore has keyword search on RDF graphs been introduced. The current approaches to keyword search, on both RDF graphs and other graphs, typically finds solutions that connects keyword matching nodes with minimal cost. Minimal costs could, for instance, be the shortest paths between the nodes. Minimal solutions, however, may not always be the most informative answers to a query. Returning parts of the graph that constitute concepts corresponding to how humans conceptualize the world, could give answers that better satisfies the information need. This thesis has explored if such concepts can be discovered automatically in an RDF graph. To this end, we have employed algorithms for finding closely connected groups of nodes in a graph. These groups of nodes are called communities, and finding them is called community detection. A state of the art review examining approaches to keyword search on graphs in general, and RDF graphs in particular, was carried out. Additionally, we performed a review on current algorithms for finding overlapping communities in a graph. Based on this review, we did a preliminary study where the most relevant of these algorithms were applied to some RDF graphs. Through the study, we found some weaknesses in the direct application on RDF graphs, with regard to our stated objective of finding concepts. This was related to the special nature of RDF graphs, compared to other graphs. In particular, we found inconsistencies in the discovered communities with regard to class information, and the meaning that is found in predicate edges. These inconsistencies prevented us from describing the found communities as logical concepts. To eliminate these weaknesses, we developed a novel community detection algorithm for finding concepts in an RDF graph. This new algorithm defines communities in RDF graphs as a set of paths, not as a collection of nodes. We run our community detection algorithm on a sample of the nodes in the graph to find the paths included in communities, and then aggregate the results. The paths that are most frequently included in a community are kept, and used to build concepts. Our algorithm ensures that the concepts are both consistent internally in a community, and consistent across the dataset. The feasibility of our approach were shown through experiments. Furthermore, the usefulness of the approach is argued through a proof-of-concept search solution that uses the concepts found by the community detection algorithm. We find that the concepts discovered by our algorithm can enhance the answers to queries, when compared to corresponding minimal solutions.