Using Ensemble Methods to Improve the Performance of Prediction - Statistical Analysis of a Myocardial Infarction Data Set from the HUNT Study

Debik, Julia Barbara

Debik, Julia Barbara

Master thesis

Åpne

12039_FULLTEXT.pdf (1.401Mb)

12039_COVER.pdf (1.556Mb)

Permanent lenke

http://hdl.handle.net/11250/2448348

Utgivelsesdato

2017

Metadata

Vis full innførsel

Samlinger

Institutt for matematiske fag [2473]

Sammendrag

The main focus of this thesis is to evaluate the use of ensemble methods to improve the

performance of prediction, when applied on a myocardial infarction (MI) data set from the

HUNT study. The data set comes from a prospective case-control study with a 10-years

follow-up (fatal and non-fatal) where MI was used as the primary endpoints. The subjects

of this study were 200 healthy individuals with age 60-79 from the HUNT2 study. The cases

(n 1 = 100) experienced an MI within the 10-year observation period, whereas the controls

(n 2 = 100) remained health during the follow-up. Several risk factors for experiencing a MI have been identified over the last years and are used in risk prediction models. The most popular prediction model is the Framingham score. However, about 15-20% of patients

experiencing MI did not score high at any of the traditional risk factors.

Recent studies have shown that microRNAs, which are small non-coding RNA molecules,

have a large potential as diagnostic biomarkers for cardiovascular disease. It is thus interesting to investigate if microRNAs also have a potential as predictive biomarkers for predicting future instances of MI.

Logistic regression and tree-based methods are commonly used to predict a binary out-

come, when predictor variables are observed. In recent years we have seen increased popu-

larity of ensemble methods. One such method is bagging (bootstrap aggregation). Bagging

is performed by resampling a data set many times, customizing a prediction model for each

resampled data set and then combining prediction models for these data sets into a new

prediction. In this thesis we examine how bagging can be applied to classification trees and

to logistic regression. We also investigate the closely related ensemble methods random

forests and random GLMs, which include only a subset of predictors in each step of the

model fitting. The predictive performance of 6 different statistical models (a pruned tree, logistic GLM, bagged tree, bagged GLM, random forest and random GLM) is evaluated

through a simulation study, where we use the area under the curve score and the Brier

score for assessing prediction accuracy.

We then fit our 6 models to the HUNT data set in order to obtain conclusions about

which predictors are relevant for predicting a future MI event. The conclusion is that

ensemble methods increase the predictive performance, in particular when applied to clas-

sification trees. The best predictive power was obtained by fitting a random GLM. Further, we have seen that microRNAs are highly relevant for predicting MI, and that the predictors BMI, serum triglycerides and serum glucose non fasting, which are not included in the Framingham risk score, are of high importance.

Utgiver

NTNU