Sales prediction in online banking
MetadataShow full item record
This master thesis seeks to explore how machine learning methods can be applied to predict the customers that are likely to purchase a credit card in Sparebank 1 SMN. The sales prediction problem has many similarities with customer churn prediction problems. We examine the current literature of both problems within the banking domain and adapt several techniques to our project. The experiment conducted follows an exploratory, result-driven approach with the primary goal of answering three research questions. We develop two machine learning models from data based on the event logs from interactions with the bank's online services and from customers' personal attributes. We define two pipelines, one for each dataset. In both pipelines we evaluate multiple classification algorithms. The first pipeline is exploratory of nature as little research has been done examining how sequential event data in the form of customer timelines can be used for training a classification model. The second pipeline is based on a traditional static customer attributes dataset commonly seen in state-of-the-art research. We apply various preprocessing and data aggregation techniques to optimise the datasets for further analysis. By performing sampling and feature selection techniques we measure the effect on model performance in terms of how well the models are able to identify likely credit card purchasers while reducing the number of incorrectly predicted purchasers. After finalising each pipeline, we examine whether a combination of the models produce better results than either model in isolation. Finally, we attempt to uncover customer segments that are likely to produce high confidence predictions. Our main findings show that the Random Forest algorithm achieves the highest performance for both datasets. The customer event timelines produced a higher performing model than the static customer attributes in terms of identifying likely credit card purchasers. The combination of the two models identifies a slightly lower amount of purchasers than either model in isolation, however greatly reduces the number of incorrectly predicted purchasers. Furthermore, by using sampling techniques to balance the proportion of purchasers to non-purchasers in the datasets, we are able to control the model's ratio between correctly and incorrectly identified purchasers.