The use of graph databases in file retrieval
Abstract
Files and the unique terms they contain can be modeled as a graph where the vertices are files and terms and the edges describe containment. Can a graph databases be used for search and retrieval of local files? What problems arise and which optimizations can be done? How does such a method compare to today's file retrieval methods?The problem is approached in this project as a potentially commerciallizable software application. The intent is to create an environment where graph based file retrieval algorithms can easily be created, explored, tested and put in production. A highly modifiable Ruby based client server file retrieval application using Titan Aurelius Graph Database and rexpro is created. The server side consists of a Ruby on Rails back end with a rexpro connection to the graph database. The server can manage connections from several clients. The client side allows the user to index their files in the graph database on the server and run search queries for strings. Algorithms in groovy for Titan Aurelius can easily be implemented and tested on the server. Though the application is well suited for testing graph database file retrieval algorithms, only one was designed, implemented and tested. This is due to the time constraints on the project. The algorithm that was implemented and tested was ran on the indexed files of one of the project members on a handful of subjectively chosen search terms. It was a relatively simple algorithm that did not benefit from the full potential of a graph based file retrieval solution. The test was done to get an initial feel for the precision and recall of the algorithm and compare it to OSX Spotlight, which is the most highly developed local file retrieval service. The framework has proved simple enough to run and test algorithms. Because there was little test driven development involved, some uncertainty remains in the results in terms of what results the algorithms that were tested actually produced. The one algorithm that was designed and tested was pitted against OSX Spotlight. The algorithm showed a significantly lower performance than OSX Spotlight in terms of average precision and recall. Many reasons for this were identifiable. For instance, file types that were very unlikely to be a match were not filtered out. In a few cases, the application performed better than OSX Spotlight. It is too soon to determine for certain that a graph based file retrieval solution can compete with todays solutions. It does however have some precision and recall and has the potential to be significantly improved from its current state.