Statistical Machine Learning on Covid-19 Time Series using Econometrics

Brataas, Eivind Hagemann

dc.contributor.advisor	Taraldsen, Gunnar
dc.contributor.advisor	Voigt, André
dc.contributor.author	Brataas, Eivind Hagemann
dc.date.accessioned	2022-09-22T17:20:14Z
dc.date.available	2022-09-22T17:20:14Z
dc.date.issued	2022
dc.identifier	no.ntnu:inspera:104766761:38566190
dc.identifier.uri	https://hdl.handle.net/11250/3020774
dc.description.abstract	I denne oppgaven blir tre tidsrekkemodeller sin evne til å predikere framtidige nye tilfeller av Covid-19 sammenliknet. Den første modellen er en maskinlæringsmodell av typen CNN-LSTM. Da modellen ble publisert i 2021, var det den beste modellen til å predikere på det globale datasettet. Noen måneder tidligere publiserte Taraldsen to såkalte «toy models» for å predikere de nye tilfellene i Norge. Begge disse modellene er SARIMA modeller, men den ene antar Gaussisk hvitt støy, mens den andre antar at støyets varians er betinget på forhistorien, modellert med GARCH støy. Begge modellene leverte gode resultater og kommer med prediksjonsintervall. Dette er ikke tilfellet med CNN-LSTM modellen. De er også mye mindre datakrevende, siden de kun har tre og fire parametere. På den andre siden har CNN-SLTM modellen over 300000 parametere å tilpasse. Ingen av modellene har blitt testet på andre datasett enn de som er brukt i artiklene deres. Denne oppgaven sammenlikner modellene sin evne til å predikere nye tilfeller på både det globale og det norske datasettet. Siden artiklene ble publisert har mye mer data blitt tilgjengelig. Dette gjør det mulig å sammenlikne modellene på andre deler av datasettene enn de som originalt ble brukt. I tillegg eksperimenteres det med redusert data på alle disse delene. Siden maskinlæringsmodellen ikke har noen måte å regne ut et prediksjonsintervall på, blir det forsøkt å regne ut usikkerheten til prediksjonen ved bruk av parametrisk bootstrap. Selv om CNN-LSTM modellen ofte treffer godt, er den som regel ikke bedre enn de to SARIMA modellene. Oppgaven konkluderer med at SARIMA modellen med GARCH støy bør benyttes i begynnelsen av datasettene, mens SARIMA modellen med Gaussisk hvitt støy bør brukes ellers. Den største forklarende faktoren er den varierende variansen til de to tidsrekkene.
dc.description.abstract	This thesis compares three models for forecasting daily new cases for Covid-19. The first model is a class of machine learning models, called an CNN-LSTM model, and was the state-of-the-art model for the stated task on the global daily new cases data set during the summer of 2021. Some months earlier, Taraldsen published two so-called toy models for forecasting the daily new cases in Norway. Both these models are SARIMA models, but one of them assumes Gaussian white noise, while the other assumes that the noise is conditionally heteroscedastic and is modeled by a GARCH model. They both gave accurate predictions and come with a prediction interval, as opposed to the CNN-LSTM model. Additionally, they were much less computer intensive, with only three and four parameters, respectively. On the other hand, the CNN-LSTM model must fit more than 300000 parameters. The models had not yet been applied to other Covid-19 data sets than what was used in their respective articles. This thesis compared the performance of the three models on the global time series data, as well as the Norwegian data. A lot more data have become available since the articles were published. This makes it possible to compare the models on other partitions of both the data sets, and to experiment with reduced sample sizes across all these partitions. Finally, a parametric bootstrap experiment was conducted to get a grasp of the uncertainty in the forecast from the CNN-LSTM model. While the CNN-LSTM model achieved some accurate forecasts, the results of this thesis suggest that the SARIMA model with GARCH noise may be the model of choice for the earliest parts of the data sets, while the SARIMA model with Gaussian white noise would be the best choice on the rest of the data sets, where its predictions are more accurate and has about the same spread. These results are mostly explained by the varying heteroscedasticity of the two time series.
dc.language	eng
dc.publisher	NTNU
dc.title	Statistical Machine Learning on Covid-19 Time Series using Econometrics
dc.type	Master thesis

Files in this item

Name:: no.ntnu:inspera:104766761:3856 ...
Size:: 14.75Mb
Format:: PDF

View/Open

Name:: no.ntnu:inspera:104766761:3856 ...
Size:: 53.13Mb
Format:: application/zip

View/Open

This item appears in the following Collection(s)

Institutt for matematiske fag [2451]

Show simple item record