Deidentification of Electronic Patient Records: A Lexicon-based Approximation
MetadataShow full item record
In 2004, a lexicon-based deidentification tool was developed at The Norwegian EHR Research Centre (NSEP). The tool was never properly tested due to lack of proper and available data material. In 2007, an annotated data set consisting of genuine encounter notes from the Norwegian primary health care was created, which had features highly appropriate for deidentification performance analysis. This project was the result of those two works, in addition to the vision of taking deidentification one step further. Questions of importance were which types of data could be found in the data set and how did the lexicon tool handle them? Which changes or additions should be implemented to enhance the overall performance? To answer these questions we had to analyze the lexicon-based deidentification tool, the sensitive data, and the deidentified ouptut from different test runs. In order to interpret and quantify the results, we used true/false positives/negatives in addition to precision, recall and F-measure, which are standard metrics in the deidentification field. Our tool performed with an overall F-measure of 66 %. The annotated data set were found to consist of 5 292 instances of personal health information (PHI), distributed over eighteen different categories. When deidentifying with respect to individual PHI categories, large variations on performance were found, with the best ones resulting in recall values up to 91 %. We found that our lexicon-based deidentification tool could not compete with the results presented by comparision projects. However, to its defence our deidentification tool had to relate to a wider variety of PHI categories than the other tools, many of which it was not constructed to handle at all. Unless PHI ambiguity issues are handled more gracefully, and the local context is interpreted, we found that our pure lexicon-based approach would not be sufficient for handling all types of PHIs.