Exploring pretrained word embeddings for multi-class text classification in Norwegian

Bjørndal, Sigrid Lofthus

dc.contributor.advisor	Gulla, Jon Atle
dc.contributor.advisor	Özgöbek, Özlem
dc.contributor.advisor	Marco, Cristina
dc.contributor.author	Bjørndal, Sigrid Lofthus
dc.date.accessioned	2019-10-31T15:17:07Z
dc.date.issued	2019
dc.identifier	no.ntnu:inspera:36079153:25765070
dc.identifier.uri	http://hdl.handle.net/11250/2625828
dc.description.abstract	Det har nylig blitt publisert flere ferdigtrente word embeddings, og en ny utvikling innen fagfeltet er bruken av kontesktualiserte word embeddings. Denne oppgaven utforsker en ferdigtrent, flerspråklig versjon av BERT-modellen, i tillegg til ferdigtrente word2vec embeddings for norsk. BERT embeddings blir kombinert med et feed-forward neuralt nett (FFNN), og word2vec embeddings blir kombinert med en FFNN-modell og en LSTM-modell. Naïve Bayes brukes som en baseline-modell. Oppgaven som embeddingene vurderes på er hierarkisk tekstklassifisering av korte norske tekster, som består av meldinger fra kundestøtte-chatten til en bank. I tillegg testes det flerspråklige aspektet av BERT ved at en FFNN-modell trenes utelukkende på norske data, og deretter testes på tilsvarende tekster på engelsk og finsk. De viktigste resultatene er at BERT embeddings er litt bedre for denne oppgaven enn word2vec embeddings, og ytelsen til sistnevnte avhenger av modellvalg og embeddingenes dimensjonalitet. BERT har noen svake overføringsevner i den flerspråklige testen når den testes på engelske data, men nesten ingen når den testes på finske.
dc.description.abstract	Several pretrained word embeddings have recently been published, and recent development in the field of NLP is the use of contextualized word embeddings. This thesis explores the use of a pretrained, multilingual version of the BERT model, as well as pretrained word2vec embeddings for Norwegian. The BERT embeddings are combined with a simple feed-forward neural network (FFNN), and the word2vec embeddings with both an FFNN and an LSTM model. Naïve Bayes is used as a baseline model. The task on which they are evaluated is hierarchical text classification of short Norwegian texts, specifically messages from the customer support chat of a Nordic bank. Additionally, the multilingual aspect of BERT is tested by training an FFNN model on exclusively Norwegian data, and subsequently testing the model on similar English and Finnish texts. The main findings are that the BERT embeddings performs slightly better than the word2vec embeddings for the task, and the performance of the latter is highly dependent on model choice and dimensionality of the embeddings. BERT was also able to correctly classify some English examples, but made close to none correct predictions on Finnish examples.
dc.language	eng
dc.publisher	NTNU
dc.title	Exploring pretrained word embeddings for multi-class text classification in Norwegian
dc.type	Master thesis

Files in this item

Name:: no.ntnu:inspera:2507502.pdf
Size:: 7.804Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Institutt for datateknologi og informatikk [6778]

Show simple item record