Improving Direct Response Modelling through Hyperparameter Optimization of Extreme Gradient Boosting and Random Forests

Schau-Hansen, Hennie

dc.contributor.advisor	Tyssedal, John Sølve
dc.contributor.advisor	Frisvold, Marianne Røe
dc.contributor.author	Schau-Hansen, Hennie
dc.date.accessioned	2022-09-09T17:19:53Z
dc.date.available	2022-09-09T17:19:53Z
dc.date.issued	2022
dc.identifier	no.ntnu:inspera:104646180:36247551
dc.identifier.uri	https://hdl.handle.net/11250/3016986
dc.description.abstract	Responsmodellering brukes i direkte markedsføring til å rangere kunder etter sannsynligheten for respons. Dette gjøres for å øke responsraten til en kampanje, og dermed øke inntektene. Denne oppgaven vil fokusere på en ringekampanje med et tilbud om å refinansiere utført av SpareBank1, rettet mot kunder som er kvalifisert for refinansiering av forbrukslån og kredittkort. Hovedmålet med denne oppgaven er å bygge og optimalisere modeller som kan forutsi hvilke kunder som vil respondere på kampanjen. Dette er en binær klassifiseringsoppgave på et ubalansert datasett, der svaret enten er ”ja” eller ”nei”. Datasettet er levert av SpareBank1 og inneholder historiske data fra tidligere ringekampanjer samlet inn fra mars 2020 til juli 2021. I tillegg er det viktig å få en forståelse av hva slags type kunde som aksepterer et slikt tilbud om refinansiering. Extreme Gradient Boosting (XGBoost) og Random Forests var de to maskinlæringsalgoritmene brukt til å bygge de predikative modellene i denne oppgaven. XGBoost ble valgt fordi den er effektiv og ofte utkonkurrerer andre metoder, mens Random Forests ble valgt fordi det er en robust og veletablert metode. Modellene ble evaluert og optimalisert med vekt på den balanserte nøyaktigheten og sensitiviteten, altså modellenes evne til å klassifisere kundene som takker ja. For å forbedre klassifiseringen, ble hyperparametrene til de to metodene optimalisert. Optimaliseringen ble først gjort med et screeningseksperiment ved bruk av forsøksplanlegging (DoE) og deretter videre optimalisering gjennom responsflatemetodikk (RSM). DoE kan identifisere hvilke hyperparametere som er mest betydningsfulle og i hvilken konfigurasjon. I tillegg ble hyperparametrene optimalisert med Bayesiansk optimering. Kombinasjoner av Bayesiansk optimering, DoE og RSM ble også testet for å sjekke effekten av screening og bruk av et sentralt sammensatt forsøk (CCD) som innledende verdier. Til slutt ble variabel viktighet før og etter optimalisering undersøkt. For begge metodene identifiserte screeningseksperimentet de mest innflytelsesrike hyperparametrene som de som direkte påvirket vektingen av klassene. Disse hyperparametrene ble valgt for ytterligere optimalisering ved bruk av RSM. RSM optimaliserte verdiene til hyperparametrene og forbedret den balanserte nøyaktigheten. Klassifiseringen av den viktige positive klassen ble også forbedret, noe som førte til en økning i sensitiviteten. Bayesiansk optimering forbedret også klassifiseringen ved å øke sensitiviteten og den balanserte nøyaktigheten. En mer stabil optimalisering ble oppnådd med Bayesiansk optimering i kombinasjon med RSM. Modeller optimalisert med Bayesiansk optimering oppnådde de høyeste verdiene for balansert nøyaktighet. Sammenlignet med referanseresultater, førte dette til en forbedring på 19% og 16% i den balanserte nøyaktigheten for XGBoost og Random Forests. Forskjellene mellom de optimaliserte modellene var imidlertid ikke store. Variabel viktigheten før og etter optimalisering viste ikke store forskjeller, og variabelen INTEREST EARNING LENDING AMT var viktig i prediksjonen.
dc.description.abstract	Response modelling can be applied in direct marketing to rank customers by the likelihood of response. This is done to increase the response rate of a campaign, and thus increase revenue. This thesis will focus on a call campaign with an offer to refinance conducted by SpareBank1, directed at customers eligible for refinancing of consumer loans and credit cards. The main objective of this thesis is to build and optimize models that can predict which customers will respond to the campaign. This is a binary classification task on an imbalanced dataset, where the response is either ”yes” or ”no”. The dataset is provided by SpareBank1 and contains historical data from previous call campaigns collected from March 2020 to July 2021. Furthermore, it is essential to understand what type of customer accepts the offer to refinance. Extreme Gradient Boosting (XGBoost) and Random Forests were the two machine learning algorithms used to build the predicative models in this thesis. XGBoost was chosen because it is effective and often outperforms other methods, while Random Forests was chosen because it is a well-established method that has been proven to be robust. The models were evaluated and optimized with emphasis on the balanced accuracy and the sensitivity, which is the model’s ability to classify the positive responders. To improve the classification, the hyperparameters of the two methods were tuned. First, the tuning was performed with a screening experiment using Design of Experiments (DoE) and then further optimization using Response Surface Methodology (RSM). DoE can identify which hyperparameters are most significant and in what configuration. Second, the hyperparameters were optimized using Bayesian optimization. Combinations of Bayesian optimization, DoE, and RSM were also tested to check the effects of screening and applying a central composite design as an initial grid. Lastly, feature importance before and after tuning was investigated. For both methods, the screening experiment identified the most influential hyperparameters as those directly affecting the class weights. These hyperparameters were chosen for further optimization using RSM. RSM successfully optimized the hyperparameter values and improved the balanced accuracy. More importantly, the classification of the important positive class improved, leading to an increase in the sensitivity. Bayesian optimization was also applied, which also improved the classification by increasing the balanced accuracy and the sensitivity. A more stable optimization was achieved with Bayesian optimization in combination with RSM. The highest balanced accuracy scores were obtained from models tuned with Bayesian optimization. Compared to benchmark results, this led to an improvement of 19% and 16% in the balanced accuracy for XGBoost and Random Forests. However, the difference between the optimized models was not evident. Tuning did not significantly affect the calculated feature importance, and the variable INTEREST EARNING LENDING AMT proved to be important in the prediction.
dc.language	eng
dc.publisher	NTNU
dc.title	Improving Direct Response Modelling through Hyperparameter Optimization of Extreme Gradient Boosting and Random Forests
dc.type	Master thesis

Tilhørende fil(er)

Filnavn:: no.ntnu:inspera:104646180:3624 ...
Størrelse:: 11.28Mb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Institutt for matematiske fag [2364]

Vis enkel innførsel