A Framework for Ontology Based Semantic Search

2018

Publicly-accessible open transport data is provided by the public sector in an effort

to create new opportunities, stimulate innovation and enable new solutions that

benefits the society. The number of datasets available are however limited. This

is partially due to the necessary, but labor intensive, preparation process of each

dataset. The datasets need to be annotated with descriptions that explain their purpose

and content. The search and retrieval functionality of current publishing platforms

are limited to classical keyword based search, which is much more restricted

than the search technology used for finding information on the world wide web.

This is due to the fact that information in most cases cannot be retrieved directly

from the data itself, but depends on the dataset descriptions. Open Datasets are encoded

in a rich variety of formats which makes it difficult to reuse them directly in

software applications. This study investigates how a transport domain knowledge

model, namely an ontology of the transport domain, can enable data to be identified

in terms of its meaning in a given context, i.e. semantics, and not by keywords

and tags alone. The study further to investigates how semantic technology can be

applied to improve discoverability and reuse of datasets. This was done by initially

developing a prototype framework for ontology based semantic classification. The

framework works as a test bed that allows for different algorithms to be tested and

compared against different ontologies. The framework also includes the development

of an online search engine that is used to measure the efficiency of the data

discovery method. This study further includes a conceptual design for a software

system that allows transport related software applications to utilize datasets from

heterogenous sources. The study finds that automated classification based on natural

language processing of dataset descriptions is possible and shows promising

results. This approach appears to improve the search and retrieval functionality of

limited datasets, however it is currently sensitive to the quality of the description

text and needs to developed further.

NTNU