A Comparative Study of Deep Learning Techniques on Frame-Level Speech Data Classification

Sabzi Shahrebabaki, Abdolreza; Imran, Ali Shariq; Olfati, Negar; Svendsen, Torbjørn Karl

dc.contributor.author	Sabzi Shahrebabaki, Abdolreza
dc.contributor.author	Imran, Ali Shariq
dc.contributor.author	Olfati, Negar
dc.contributor.author	Svendsen, Torbjørn Karl
dc.date.accessioned	2019-11-29T09:27:07Z
dc.date.available	2019-11-29T09:27:07Z
dc.date.created	2019-05-02T13:29:29Z
dc.date.issued	2019
dc.identifier.citation	Circuits, systems, and signal processing. 2019, 34 (1130), 1-20.	nb_NO
dc.identifier.issn	0278-081X
dc.identifier.uri	http://hdl.handle.net/11250/2630983
dc.description.abstract	This paper provides a comprehensive analysis of the effect of speaking rate on frame classification accuracy. Different speaking rates may affect the performance of the automatic speech recognition system yielding poor recognition accuracy. A model trained on a normal speaking rate is better able to recognize speech at a normal pace but fails to achieve similar performance when tested on slow or fast speaking rates. Our recent study has shown that a drop of almost ten percentage points in the classification accuracy is observed when a deep feed-forward network is trained on the normal speaking rate and evaluated on slow and fast speaking rates. In this paper, we extend our work to convolutional neural networks (CNN) to see if this model can reduce the accuracy gap between different speaking rates. Filter bank energies (FBE) and Mel frequency cepstral coefficients are evaluated on multiple configurations of the CNN where the networks are trained on normal speaking rate and evaluated on slow and fast speaking rates. The results are compared to those obtained by a deep neural network. A breakdown of phoneme-level classification results and the confusion between vowels and consonants is also presented. The experiments show that the CNN architecture when used with FBE features performs better on both slow and fast speaking rates. An improvement of nearly 2% in case of fast and 3% in case of slow speaking rates is observed.	nb_NO
dc.language.iso	eng	nb_NO
dc.publisher	Springer Verlag	nb_NO
dc.title	A Comparative Study of Deep Learning Techniques on Frame-Level Speech Data Classification	nb_NO
dc.type	Journal article	nb_NO
dc.type	Peer reviewed	nb_NO
dc.description.version	acceptedVersion	nb_NO
dc.source.pagenumber	1-20	nb_NO
dc.source.volume	34	nb_NO
dc.source.journal	Circuits, systems, and signal processing	nb_NO
dc.source.issue	1130	nb_NO
dc.identifier.doi	10.1007/s00034-019-01130-0
dc.identifier.cristin	1695153
dc.description.localcode	This is a post-peer-review, pre-copyedit version of the article. Locked until 3.5.2020 due to copyright restrictions.	nb_NO
cristin.unitcode	194,63,35,0
cristin.unitcode	194,63,1,0
cristin.unitname	Institutt for elektroniske systemer
cristin.unitname	IE fakultetsadministrasjon
cristin.ispublished	true
cristin.fulltext	original
cristin.qualitycode	1

Tilhørende fil(er)

Filnavn:: Sabzi.pdf
Størrelse:: 830.7Kb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Fakultet for informasjonsteknologi og elektroteknikk (Uspesifisert) [120]
Institutt for elektroniske systemer [2334]
Publikasjoner fra CRIStin - NTNU [38291]

Vis enkel innførsel