Using Ensemble Methods to Improve the Performance of Prediction - Statistical Analysis of a Myocardial Infarction Data Set from the HUNT Study

Debik, Julia Barbara

dc.contributor.advisor	Langaas, Mette
dc.contributor.author	Debik, Julia Barbara
dc.date.accessioned	2017-07-10T14:00:57Z
dc.date.available	2017-07-10T14:00:57Z
dc.date.created	2017-06-12
dc.date.issued	2017
dc.identifier	ntnudaim:12039
dc.identifier.uri	http://hdl.handle.net/11250/2448348
dc.description.abstract	The main focus of this thesis is to evaluate the use of ensemble methods to improve the performance of prediction, when applied on a myocardial infarction (MI) data set from the HUNT study. The data set comes from a prospective case-control study with a 10-years follow-up (fatal and non-fatal) where MI was used as the primary endpoints. The subjects of this study were 200 healthy individuals with age 60-79 from the HUNT2 study. The cases (n 1 = 100) experienced an MI within the 10-year observation period, whereas the controls (n 2 = 100) remained health during the follow-up. Several risk factors for experiencing a MI have been identified over the last years and are used in risk prediction models. The most popular prediction model is the Framingham score. However, about 15-20% of patients experiencing MI did not score high at any of the traditional risk factors. Recent studies have shown that microRNAs, which are small non-coding RNA molecules, have a large potential as diagnostic biomarkers for cardiovascular disease. It is thus interesting to investigate if microRNAs also have a potential as predictive biomarkers for predicting future instances of MI. Logistic regression and tree-based methods are commonly used to predict a binary out- come, when predictor variables are observed. In recent years we have seen increased popu- larity of ensemble methods. One such method is bagging (bootstrap aggregation). Bagging is performed by resampling a data set many times, customizing a prediction model for each resampled data set and then combining prediction models for these data sets into a new prediction. In this thesis we examine how bagging can be applied to classification trees and to logistic regression. We also investigate the closely related ensemble methods random forests and random GLMs, which include only a subset of predictors in each step of the model fitting. The predictive performance of 6 different statistical models (a pruned tree, logistic GLM, bagged tree, bagged GLM, random forest and random GLM) is evaluated through a simulation study, where we use the area under the curve score and the Brier score for assessing prediction accuracy. We then fit our 6 models to the HUNT data set in order to obtain conclusions about which predictors are relevant for predicting a future MI event. The conclusion is that ensemble methods increase the predictive performance, in particular when applied to clas- sification trees. The best predictive power was obtained by fitting a random GLM. Further, we have seen that microRNAs are highly relevant for predicting MI, and that the predictors BMI, serum triglycerides and serum glucose non fasting, which are not included in the Framingham risk score, are of high importance.
dc.language	eng
dc.publisher	NTNU
dc.subject	Fysikk og matematikk, Industriell matematikk
dc.title	Using Ensemble Methods to Improve the Performance of Prediction - Statistical Analysis of a Myocardial Infarction Data Set from the HUNT Study
dc.type	Master thesis

Tilhørende fil(er)

Filnavn:: 12039_FULLTEXT.pdf
Størrelse:: 1.401Mb
Format:: PDF

Åpne

Filnavn:: 12039_COVER.pdf
Størrelse:: 1.556Mb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Institutt for matematiske fag [2527]

Vis enkel innførsel