Modeling the Interpretability of an End-to-End Automatic Speech Recognition System
Adapted to Norwegian Speech

Lunde, Solveig Reppen

dc.contributor.advisor	Salvi, Giampiero
dc.contributor.advisor	Ortiz, Pablo
dc.contributor.author	Lunde, Solveig Reppen
dc.date.accessioned	2022-09-09T17:19:22Z
dc.date.available	2022-09-09T17:19:22Z
dc.date.issued	2022
dc.identifier	no.ntnu:inspera:104140281:37119015
dc.identifier.uri	https://hdl.handle.net/11250/3016970
dc.description.abstract	Formålet med dette arbeidet var å modellere tolkbarheten til et automatisk talegjenkjenningssystem trent på norsk tale. Systemet er et ende-til-ende dypt nevralt nettverk som tar inn taledata og er trent for å gi ut ortografisk tekst. Gjennom fonem- og grafemklassifikasjon analyserte vi den fonetiske og ortografiske representasjonen av de ulike lagene til det dype nettverket. Resultatene våre viser at den fonetiske representasjonen forbedres for de nederste lagene, før den degraderes for de øverste lagene av systemet. Den ortografiske representasjonen forbedres gradvis jo høyere opp i systemet vi kommer, med en betydelig forbedring for det siste laget. Dette indikerer at de øverste lagene er mer fokusert på grafemer, som er plausibelt siden systemet er trent til å gi ut ortografisk tekst. Resultatene våre indikerer at systemet implisitt lærer fonetisk representasjon, selv om dette ikke blir lært direkte gjennom treningsmetoden - den fokuserer på den ortografiske representasjonen. For en videre analyse av systemets funksjonalitet undersøkte vi om informasjonen som er lagret i modellen er dialektavhengig, ved å teste systemet på spontan tale fra tolv ulike dialektgrupper. Dialektene fra de sørøstlige områdene av Norge ga lavest feilrate, mens dialektene fra Vest- og Midt-Norge ga høyest feilrate. Resultatene ser ut til å ha en tydelig sammenheng med dialektvariasjoner, da dialektene som ga lavest feilrate er de som har flest fellestrekk med Bokmål, mens de med høyest feilrate er de som skiller seg mest fra Bokmål. Dette ble verifisert ved å analysere enkelte funksjonsord og verb som brukes mye i det norske språket, og som varierer betydelig mellom ulike dialekter. Resultatene viser at systemet oppnår en lav feilrate når ordene er uttalt på lignende måte som den skriftlige Bokmålsformen, men har betydelige problemer med å gjenkjenne ordet for dialektvarianter som skiller seg ut fra Bokmålsformen. Dette er rimelig da systemet er finjustert på taledata fra opplest Bokmåltekst. For å forbedre talegjenkjenningssystemer for norske dialekter, anbefaler vi at treningsdataen består av mer spontan tale som innebærer flere dialektuttrykk, og at språkmodellene som brukes til dekoding burde bli trent med en mer balansert mengde Nynorsk og Bokmål tekst.
dc.description.abstract	This project aims to model the interpretability of an automatic speech recognition (ASR) system adapted to Norwegian speech. The ASR system is an end-to-end (E2E) deep neural network that takes speech signals as input and is trained to output orthographic text. By performing phoneme and grapheme classification, we investigated how phonetic and orthographic information is encoded in the model layers. Our results show that the phonetic representation improves for the lower layers but is degraded for the higher layers. The orthographic representation improves gradually for the higher layers, with a considerable improvement for the last layer. This indicates that the last layers are more geared toward graphemes, which is reasonable since it is trained to output orthographic text. Our results indicate that the model learns phonetic representation implicitly, even though there is nothing with the training method that forces it to learn phonetic representation. For further analysis of the model's interpretability, we investigated whether the information encoded in the model is dialect-dependent, by testing the ASR model on spontaneous speech from twelve different dialect groups. The dialects from the southeastern parts of Norway achieved the lowest error rates (and thus the best results), while the dialects from the middle and western parts of Norway gave the highest error rates. The results seem to have a clear correspondence with dialectal variation since the best recognized dialects are the ones being closest to Bokmål, while the worst recognized are those that are most distinct from Bokmål. This was verified by analyzing the recognition of some functional words and verbs that have distinctive dialectal forms and a high occurrence in the Norwegian language. The results show that the model achieves a low error rate when the words are pronounced similarly to the Bokmål form, while it struggles with transcribing dialectal forms that deviate clearly from the Bokmål form. We find this reasonable since the ASR model is fine-tuned on read-aloud Bokmål text. For improving ASR systems for Norwegian dialects, we propose including more training data from spontaneous speech, which involves more dialect-specific words, and having a more balanced amount of Nynorsk and Bokmål data for training the language model.
dc.language	eng
dc.publisher	NTNU
dc.title	Modeling the Interpretability of an End-to-End Automatic Speech Recognition System Adapted to Norwegian Speech
dc.type	Master thesis

Files in this item

Name:: no.ntnu:inspera:104140281:3711 ...
Size:: 12.33Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Institutt for elektroniske systemer [2289]

Show simple item record

Modeling the Interpretability of an End-to-End Automatic Speech Recognition System Adapted to Norwegian Speech

Files in this item

This item appears in the following Collection(s)