Tracking the Lineage of Arbitrary Processing Sequences

Valeur, Håvar

Valeur, Håvar

Master thesis

Åpne

348134_FULLTEXT01.pdf (817.0Kb)

Permanent lenke

http://hdl.handle.net/11250/251028

Utgivelsesdato

2005

Metadata

Vis full innførsel

Samlinger

Institutt for datateknologi og informatikk [6769]

Sammendrag

Data is worthless without knowing what the data represents, and you need metadata to efficiently manage large data sets. As computing power becomes cheaper and more data is derived, metadata becomes more important than ever. Today researcher are setting more experimental scientific workflows than before. As a result a lot of steps leading to the implementation are skipped. The leading steps usu- ally included documenting the work, which is not a central part of the more experimental approach. Since documenting is no longer a natural part of the scientific workflow, and the workflow might be changing a lot though its lifetime, many data products are lacking documentation. Since the way the scientist work have changed, we feel the way they document their work need to change. Currently there is no metadata system that retrieves metadata di- rectly from the scientific process without having the researcher having to change his code or in other ways manually set up the system to handle the workflow. This thesis suggest ways to automate the metadata retrieval, and shows how two of these techniques can be implemented. Automatic linage and metadata retrieval will help the researchers document the process a data product have gone though. My implementation shows how to retrieve linage and metadata by instrumenting Interactive Data Language scripts, and how to re- trieve linage from shell script by looking at the system calls made by the executable. The implementation discussed in this paper is intended to be a client for the Earth System Science Server, a metadata system for earth science data.

Utgiver

Institutt for datateknikk og informasjonsvitenskap