Using Ensemble Methods to Improve the Performance of Prediction - Statistical Analysis of a Myocardial Infarction Data Set from the HUNT Study
Abstract
The main focus of this thesis is to evaluate the use of ensemble methods to improve theperformance of prediction, when applied on a myocardial infarction (MI) data set from theHUNT study. The data set comes from a prospective case-control study with a 10-yearsfollow-up (fatal and non-fatal) where MI was used as the primary endpoints. The subjectsof this study were 200 healthy individuals with age 60-79 from the HUNT2 study. The cases(n 1 = 100) experienced an MI within the 10-year observation period, whereas the controls(n 2 = 100) remained health during the follow-up. Several risk factors for experiencing a MI have been identified over the last years and are used in risk prediction models. The most popular prediction model is the Framingham score. However, about 15-20% of patientsexperiencing MI did not score high at any of the traditional risk factors.Recent studies have shown that microRNAs, which are small non-coding RNA molecules,have a large potential as diagnostic biomarkers for cardiovascular disease. It is thus interesting to investigate if microRNAs also have a potential as predictive biomarkers for predicting future instances of MI.Logistic regression and tree-based methods are commonly used to predict a binary out-come, when predictor variables are observed. In recent years we have seen increased popu-larity of ensemble methods. One such method is bagging (bootstrap aggregation). Baggingis performed by resampling a data set many times, customizing a prediction model for eachresampled data set and then combining prediction models for these data sets into a newprediction. In this thesis we examine how bagging can be applied to classification trees andto logistic regression. We also investigate the closely related ensemble methods randomforests and random GLMs, which include only a subset of predictors in each step of themodel fitting. The predictive performance of 6 different statistical models (a pruned tree, logistic GLM, bagged tree, bagged GLM, random forest and random GLM) is evaluatedthrough a simulation study, where we use the area under the curve score and the Brierscore for assessing prediction accuracy.We then fit our 6 models to the HUNT data set in order to obtain conclusions aboutwhich predictors are relevant for predicting a future MI event. The conclusion is thatensemble methods increase the predictive performance, in particular when applied to clas-sification trees. The best predictive power was obtained by fitting a random GLM. Further, we have seen that microRNAs are highly relevant for predicting MI, and that the predictors BMI, serum triglycerides and serum glucose non fasting, which are not included in the Framingham risk score, are of high importance.