Vis enkel innførsel

dc.contributor.advisorLangaas, Mette
dc.contributor.authorDebik, Julia Barbara
dc.date.accessioned2017-07-10T14:00:57Z
dc.date.available2017-07-10T14:00:57Z
dc.date.created2017-06-12
dc.date.issued2017
dc.identifierntnudaim:12039
dc.identifier.urihttp://hdl.handle.net/11250/2448348
dc.description.abstractThe main focus of this thesis is to evaluate the use of ensemble methods to improve the performance of prediction, when applied on a myocardial infarction (MI) data set from the HUNT study. The data set comes from a prospective case-control study with a 10-years follow-up (fatal and non-fatal) where MI was used as the primary endpoints. The subjects of this study were 200 healthy individuals with age 60-79 from the HUNT2 study. The cases (n 1 = 100) experienced an MI within the 10-year observation period, whereas the controls (n 2 = 100) remained health during the follow-up. Several risk factors for experiencing a MI have been identified over the last years and are used in risk prediction models. The most popular prediction model is the Framingham score. However, about 15-20% of patients experiencing MI did not score high at any of the traditional risk factors. Recent studies have shown that microRNAs, which are small non-coding RNA molecules, have a large potential as diagnostic biomarkers for cardiovascular disease. It is thus interesting to investigate if microRNAs also have a potential as predictive biomarkers for predicting future instances of MI. Logistic regression and tree-based methods are commonly used to predict a binary out- come, when predictor variables are observed. In recent years we have seen increased popu- larity of ensemble methods. One such method is bagging (bootstrap aggregation). Bagging is performed by resampling a data set many times, customizing a prediction model for each resampled data set and then combining prediction models for these data sets into a new prediction. In this thesis we examine how bagging can be applied to classification trees and to logistic regression. We also investigate the closely related ensemble methods random forests and random GLMs, which include only a subset of predictors in each step of the model fitting. The predictive performance of 6 different statistical models (a pruned tree, logistic GLM, bagged tree, bagged GLM, random forest and random GLM) is evaluated through a simulation study, where we use the area under the curve score and the Brier score for assessing prediction accuracy. We then fit our 6 models to the HUNT data set in order to obtain conclusions about which predictors are relevant for predicting a future MI event. The conclusion is that ensemble methods increase the predictive performance, in particular when applied to clas- sification trees. The best predictive power was obtained by fitting a random GLM. Further, we have seen that microRNAs are highly relevant for predicting MI, and that the predictors BMI, serum triglycerides and serum glucose non fasting, which are not included in the Framingham risk score, are of high importance.
dc.languageeng
dc.publisherNTNU
dc.subjectFysikk og matematikk, Industriell matematikk
dc.titleUsing Ensemble Methods to Improve the Performance of Prediction - Statistical Analysis of a Myocardial Infarction Data Set from the HUNT Study
dc.typeMaster thesis


Tilhørende fil(er)

Thumbnail
Thumbnail

Denne innførselen finnes i følgende samling(er)

Vis enkel innførsel