dc.description.abstract | The main focus of this thesis is to evaluate the use of ensemble methods to improve the
performance of prediction, when applied on a myocardial infarction (MI) data set from the
HUNT study. The data set comes from a prospective case-control study with a 10-years
follow-up (fatal and non-fatal) where MI was used as the primary endpoints. The subjects
of this study were 200 healthy individuals with age 60-79 from the HUNT2 study. The cases
(n 1 = 100) experienced an MI within the 10-year observation period, whereas the controls
(n 2 = 100) remained health during the follow-up. Several risk factors for experiencing a MI have been identified over the last years and are used in risk prediction models. The most popular prediction model is the Framingham score. However, about 15-20% of patients
experiencing MI did not score high at any of the traditional risk factors.
Recent studies have shown that microRNAs, which are small non-coding RNA molecules,
have a large potential as diagnostic biomarkers for cardiovascular disease. It is thus interesting to investigate if microRNAs also have a potential as predictive biomarkers for predicting future instances of MI.
Logistic regression and tree-based methods are commonly used to predict a binary out-
come, when predictor variables are observed. In recent years we have seen increased popu-
larity of ensemble methods. One such method is bagging (bootstrap aggregation). Bagging
is performed by resampling a data set many times, customizing a prediction model for each
resampled data set and then combining prediction models for these data sets into a new
prediction. In this thesis we examine how bagging can be applied to classification trees and
to logistic regression. We also investigate the closely related ensemble methods random
forests and random GLMs, which include only a subset of predictors in each step of the
model fitting. The predictive performance of 6 different statistical models (a pruned tree, logistic GLM, bagged tree, bagged GLM, random forest and random GLM) is evaluated
through a simulation study, where we use the area under the curve score and the Brier
score for assessing prediction accuracy.
We then fit our 6 models to the HUNT data set in order to obtain conclusions about
which predictors are relevant for predicting a future MI event. The conclusion is that
ensemble methods increase the predictive performance, in particular when applied to clas-
sification trees. The best predictive power was obtained by fitting a random GLM. Further, we have seen that microRNAs are highly relevant for predicting MI, and that the predictors BMI, serum triglycerides and serum glucose non fasting, which are not included in the Framingham risk score, are of high importance. | |