Deidentification of Electronic Patient Records: A Lexicon-based Approximation

Olafsen, Stian

dc.contributor.advisor	Nytrø, Øystein	nb_NO
dc.contributor.author	Olafsen, Stian	nb_NO
dc.date.accessioned	2014-12-19T13:31:18Z
dc.date.available	2014-12-19T13:31:18Z
dc.date.created	2010-09-02	nb_NO
dc.date.issued	2008	nb_NO
dc.identifier	347084	nb_NO
dc.identifier	ntnudaim:3583	nb_NO
dc.identifier.uri	http://hdl.handle.net/11250/250299
dc.description.abstract	In 2004, a lexicon-based deidentification tool was developed at The Norwegian EHR Research Centre (NSEP). The tool was never properly tested due to lack of proper and available data material. In 2007, an annotated data set consisting of genuine encounter notes from the Norwegian primary health care was created, which had features highly appropriate for deidentification performance analysis. This project was the result of those two works, in addition to the vision of taking deidentification one step further. Questions of importance were which types of data could be found in the data set and how did the lexicon tool handle them? Which changes or additions should be implemented to enhance the overall performance? To answer these questions we had to analyze the lexicon-based deidentification tool, the sensitive data, and the deidentified ouptut from different test runs. In order to interpret and quantify the results, we used true/false positives/negatives in addition to precision, recall and F-measure, which are standard metrics in the deidentification field. Our tool performed with an overall F-measure of 66 %. The annotated data set were found to consist of 5 292 instances of personal health information (PHI), distributed over eighteen different categories. When deidentifying with respect to individual PHI categories, large variations on performance were found, with the best ones resulting in recall values up to 91 %. We found that our lexicon-based deidentification tool could not compete with the results presented by comparision projects. However, to its defence our deidentification tool had to relate to a wider variety of PHI categories than the other tools, many of which it was not constructed to handle at all. Unless PHI ambiguity issues are handled more gracefully, and the local context is interpreted, we found that our pure lexicon-based approach would not be sufficient for handling all types of PHIs.	nb_NO
dc.language	eng	nb_NO
dc.publisher	Institutt for datateknikk og informasjonsvitenskap	nb_NO
dc.subject	ntnudaim	no_NO
dc.subject	MIT informatikk	no_NO
dc.subject	Kunstig intelligens og læring	no_NO
dc.title	Deidentification of Electronic Patient Records: A Lexicon-based Approximation	nb_NO
dc.type	Master thesis	nb_NO
dc.source.pagenumber	97	nb_NO
dc.contributor.department	Norges teknisk-naturvitenskapelige universitet, Fakultet for informasjonsteknologi, matematikk og elektroteknikk, Institutt for datateknikk og informasjonsvitenskap	nb_NO

Tilhørende fil(er)

Filnavn:: 347084_FULLTEXT01.pdf
Størrelse:: 1.390Mb
Format:: PDF

Låst

Filnavn:: 347084_COVER01.pdf
Størrelse:: 46.47Kb
Format:: PDF

Låst

Denne innførselen finnes i følgende samling(er)

Institutt for datateknologi og informatikk [6544]

Vis enkel innførsel