Graph-based Natural Language Processing: Graph edit distance applied to the task of detecting plagiarism

The focus of this thesis is the exploration of graph-based similarity, in the context of natural language processing. The work is motivated by a need for richer representations of text. A graph edit distance algorithm was implemented, that calculates the difference between graphs. Sentences were represented by means of dependency graphs, which consist of words connected by dependencies. A dependency graph captures the syntactic structure of a sentence. The graph-based similarity approach was applied to the problem of detecting plagiarism, and was compared against state of the art systems. The key advantages of graph-based textual representations are mainly word order indifference and the ability to capture similarity between words, based on the sentence structure. The approach was compared against contributions made to the PAN plagiarism detection challenge at the CLEF 2011 conference, and would have achieved a 5th place out of 10 contestants. The evaluation results suggest that the approach can be applicable to the task of detecting plagiarism, but require some fine tuning on input parameters. The evaluation results demonstrated that dependency graphs are best represented by directed edges. The graph edit distance algorithm scored best with a combination of node and edge label matching. Different edit weights were applied, which increased performance. Keywords: Graph Edit Distance, Natural Language Processing, Dependency Graphs, Plagiarism Detection

Utgiver

Institutt for datateknikk og informasjonsvitenskap