Sentiment Analysis of Customer Emails Using BERT

Langli, Karoline Lillevestre

dc.contributor.advisor	Gulla, Jon Atle
dc.contributor.advisor	Kille, Benjamin
dc.contributor.author	Langli, Karoline Lillevestre
dc.date.accessioned	2023-10-18T17:20:32Z
dc.date.available	2023-10-18T17:20:32Z
dc.date.issued	2023
dc.identifier	no.ntnu:inspera:142737689:35330989
dc.identifier.uri	https://hdl.handle.net/11250/3097373
dc.description.abstract	I løpet av de siste årene har språkmodeller blitt veldig populære, og de brukes for øyeblikket til å løse ulike oppgaver innen naturlig språkprosessering (NLP). Mange selskap har store mengder ustrukturert tekst lagret som fortsatt ikke blir prosessert automatisk. Derfor undersøker denne oppgaven om språkmodeller kan brukes for å automatisk prosessere eposter fra kunder sendt til Sparebank 1 SMN. Oppgaven bruker de fire BERT-modellene NB-BERT, NorBERT, mBERT og DistilmBERT for å utføre sentimentanalyse av norsk tekst. BERT modellene ble også sammenliknet med mindre modeller som brukte TF-IDF etterfulgt av SVM, logistisk regresjon eller KNN. To datasett ble brukt: Det offentlig tilgjengelige datasettet NoReC bestående av anmeldelser, og et datasett med eposter sendt fra kunder til Sparebank 1 SMN. Aktiv læring ble gjennomført for å klassifisere epostene. Dette fungerte til en viss grad, men det er fortsatt rom for forbedring. Modellene ble evaluert ved bruk av F1-scoren og presisjonen oppnådd på den negative klassen, og ved å se på de genererte forvirringsmatrisene. NB-BERT var den beste modellen på NoReC datasettet, mens NorBERT gjorde det best på epostene. Modellen som brukte TF-IDF oppnådde høyere score enn mBERT og DistilmBERT på NoReC datasettet, og mBERT, DistilmBERT og NB-BERT på epostene. I tillegg til dette ble det vist at prediksjon gjennomføres mye raskere av modellen som brukte TF-IDF enn ved bruk av BERT. Koden skrevet i denne masteroppgaven er tilgjengelig på Github.
dc.description.abstract	In recent years, language models have become popular, and they are currently used to solve various natural language processing tasks. Many companies have stored large quantities of unstructured text that is still not being automatically processed. Therefore, this thesis examines if language models can automatically process customer emails sent to Sparebank 1 SMN. This thesis utilizes four BERT models, namely NB-BERT, NorBERT, mBERT, and DistilmBERT, to perform sentiment analysis on Norwegian text. The BERT models were also compared to baseline models using TF-IDF combined with SVM, logistic regression, or K-nearest neighbor. Two datasets were used: the publicly available NoReC dataset containing reviews and a dataset of customer emails provided by Sparebank 1 SMN. Active learning was performed on the emails to create sentiment labels. This worked to some extent, but there is still room for improvement. The models were evaluated using the F1-score and precision of the negative class and by examining the calculated confusion matrices. NB-BERT achieved the highest score on the NoReC dataset, and NorBERT did best on the emails. The baseline outperformed mBERT and DistilmBERT on the NoReC dataset and mBERT, DistilmBERT, and NB-BERT on the email dataset. Additionally, it was shown that the predictions were notably slower with the BERT models compared to the baseline. The code written during this thesis is available on Github.
dc.language	eng
dc.publisher	NTNU
dc.title	Sentiment Analysis of Customer Emails Using BERT
dc.type	Master thesis

Files in this item

Name:: no.ntnu:inspera:142737689:3533 ...
Size:: 9.251Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Institutt for datateknologi og informatikk [6569]

Show simple item record