Text Mining in Health Records: Classification of Text to Facilitate Information Flow and Data Overview
Abstract
This project consists of two parts. In the first part we apply techniques from the field of text mining to classify sentences in encounter notes of the electronic health record (EHR) into classes of {it subjective}, {it objective} and {it plan} character. This is a simplification of the {it SOAP} standard, and is applied due to the way GPs structure the encounter notes. Structuring the information in a subjective, objective, and plan way, may enhance future information flow between the EHR and the personal health record (PHR). In the second part of the project we seek to use apply the most adequate to classify encounter notes from patient histories of patients suffering from diabetes. We believe that the distribution of sentences of a subjective, objective, and plan character changes according to different phases of diseases. In our work we experiment with several preprocessing techniques, classifiers, and amounts of data. Of the classifiers considered, we find that Complement Naive Bayes (CNB) produces the best result, both when the preprocessing of the data has taken place and not. On the raw dataset, CNB yields an accuracy of 81.03%, while on the preprocessed dataset, CNB yields an accuracy of 81.95%. The Support Vector Machines (SVM) classifier algorithm yields results comparable to the results obtained by use of CNB, while the J48 classifier algorithm performs poorer. Concerning preprocessing techniques, we find that use of techniques reducing the dimensionality of the datasets improves the results for smaller attribute sets, but worsens the result for larger attribute sets. The trend is opposite for preprocessing techniques that expand the set of attributes. However, finding the ratio between the size of the dataset and the number of attributes, where the preprocessing techniques improve the result, is difficult. Hence, preprocessing techniques are not applied in the second part of the project. From the result of the classification of the patient histories we have extracted graphs that show how the sentence class distribution after the first diagnosis of diabetes is set. Although no empiric research is carried out, we believe that such graphs may, through further research, facilitate the recognition of points of interest in the patient history. From the same results we also create graphs that show the average distribution of sentences of subjective, objective, and plan character for 429 patients after the first diagnosis of diabetes is set. From these graphs we find evidence that there is an overrepresentation of subjective sentences in encounter notes where the diagnosis of diabetes is first set. However, we believe that similar experiments for several diseases, may uncover patterns or trends concerning the diseases in focus.