Articulatory Inversion for Speech Technology Applications

Shahrebabaki, Abdolreza Sabzi

dc.contributor.advisor	Svendsen, Torbjørn
dc.contributor.advisor	Johnsen, Magne H.
dc.contributor.advisor	Siniscalchi, Sabato Marco
dc.contributor.author	Shahrebabaki, Abdolreza Sabzi
dc.date.accessioned	2022-06-22T08:25:06Z
dc.date.available	2022-06-22T08:25:06Z
dc.date.issued	2022
dc.identifier.isbn	978-82-326-6324-8
dc.identifier.issn	2703-8084
dc.identifier.uri	https://hdl.handle.net/11250/2999962
dc.description.abstract	Within the past decades advances in neural networks have improved the performance of a vast area of speech processing applications including the articulatory inversion problem which is concerned with estimating the vocal tract shape in the form of articulators’ position based on the uttered speech. In spite of these improvements the articulatory inversion problem still needs improvements in order to be further utilized for other speech application as a complementary source of information. Articulatory measurements have been employed in various applications such as speech synthesis, computer aided pronunciation training and automatic speech recognition. Measuring articulator movements requires complex procedures and systems, which makes it impossible to perform measurements outside of labs. There are databases containing a limited number of speakers which have synchronously recorded articulator movements and uttered speech. This thesis explores the articulatory inversion problem within different scenarios where there are mismatches between training data and test data. These mismatches include speaker mismatches within a database or across databases, mismatches in the speaking rate of speakers, and mismatches in the environment where the data are synthetically created by incorporating various noises. The first part of the thesis focus on incorporating linguistic information such as forced aligned phonemic features, attribute features based on manner and place of articulation, and their combination with the acoustic features. Furthermore, new architectures are developed based on the acoustic landmarks theory which tells that abrupt changes in the speech spectrum are the results of changes in the articulators’ configuration. Later on, transfer learning of articulatory information based on phonemic features is utilized to generate articulatory trajectories for the TIMIT database. Phone recognition experiments provide evidence of the effectiveness of the proposed transfer learning approach. Furthermore, a novel architecture is proposed to estimate articulatory trajectories directly from the time domain speech signal by utilizing 1D convolutional filters. The 1D convolutional layers extract features and the decimation operators match the sampling rate of acoustic signal with the articulatory measurements’ sampling rate. The data driven features extracted by 1D convolutional layers are better able to capture and compensate the variability resulted by mismatch in the speaking rates. In the second part of the thesis the focus is on articulatory inversion performance in noisy conditions. Synthetically produced noisy acoustic data are used for this experiment evaluation. Speech enhancement based on deep neural networks prior to the articulatory inversion trained on clean data, slightly outperforms the articulatory inversion system trained on multi-condition noisy data. We propose a joint network which performs both speech enhancement and articulatory inversion. The articulatory inversion part of the joint model outperforms the trained model on multi-condition noisy data in the low signal to noise ratio range, namely 0, 5 and 10 dB. The estimated articulatory data are further used to train a word recognition system trained on clean acoustic and articulatory features for the WSJ dataset. For the noisy condition, the word error rate of the recognition system trained on both acoustic and articulatory data is significantly less than the model trained only on the clean acoustic data.	en_US
dc.language.iso	eng	en_US
dc.publisher	NTNU	en_US
dc.relation.ispartofseries	Doctoral theses at NTNU;2022:198
dc.relation.haspart	Paper A: Sabzi Shahrebabaki, Abdolreza; Olfati, Negar; Imran, Ali Shariq; Sabato Marco, Siniscalchi; Svendsen, Torbjørn Karl. A Phonetic-Level Analysis of Different Input Features for Articulatory Inversion. Interspeech (USB) 2019 s. 3775-3779 https://doi.org/10.21437/Interspeech.2019-2526	en_US
dc.relation.haspart	Paper B: Sabzi Shahrebabaki, Abdolreza; Siniscalchi, Marco; Salvi, Giampiero; Svendsen, Torbjørn Karl. Sequence-to-sequence articulatory inversion through time convolution of sub-band frequency signals. Interspeech (USB) 2020 s. 2882-2886 https://doi.org/10.21437/Interspeech.2020-1140	en_US
dc.relation.haspart	Paper C: Sabzi Shahrebabaki, Abdolreza; Olfati, Negar; Siniscalchi, Sabato Marco; Salvi, Giampiero; Svendsen, Torbjørn. Transfer learning of articulatory information through phone information.. Interspeech (USB) 2020 s. 2877-2881 https://doi.org/10.21437/Interspeech.2020-1139	en_US
dc.relation.haspart	Paper D: Sabzi Shahrebabaki, Abdolreza; Siniscalchi, Sabato Marco; Salvi, Giampiero; Svendsen, Torbjørn Karl. A DNN Based Speech Enhancement Approach to Noise Robust Acoustic-to-Articulatory Inversion. I: Proceedings 2021 IEEE International Symposium on Circuits and Systems. https://doi.org/10.1109/ISCAS51556.2021.9401290	en_US
dc.relation.haspart	Paper E: Sabzi Shahrebabaki, Abdolreza; Salvi, Giampiero; Svendsen, Torbjørn Karl; Siniscalchi, Sabato Marco. Acoustic-to-Articulatory Mapping With Joint Optimization of Deep Speech Enhancement and Articulatory Inversion Models. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP) 2021 ;Volum 30. https://doi.org/Sabzi Shahrebabaki, Abdolreza; Salvi, Giampiero; Svendsen, Torbjørn Karl; Siniscalchi, Sabato Marco. Acoustic-to-Articulatory Mapping With Joint Optimization of Deep Speech Enhancement and Articulatory Inversion Models. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP) 2021 ;Volum 30. CCBY	en_US
dc.relation.haspart	Paper F: Sabzi Shahrebabaki, Abdolreza; Siniscalchi, Sabato Marco; Svendsen, Torbjørn Karl. Raw Speech-to-Articulatory Inversion by Temporal Filtering and Decimation. Interspeech 2021 https://doi.org/10.21437/Interspeech.2021-1429	en_US
dc.title	Articulatory Inversion for Speech Technology Applications	en_US
dc.type	Doctoral thesis	en_US

Tilhørende fil(er)

Filnavn:: Abdolreza Sabzi Shahrebabaki.pdf
Størrelse:: 9.579Mb
Format:: PDF
Beskrivelse:: Abdolreza Sabzi Shahrebabaki PhD

Åpne

Filnavn:: Abdolreza Sabzi Shahrebabaki_P ...
Størrelse:: 11.48Mb
Format:: PDF
Beskrivelse:: Shahrebabaki

Låst

Denne innførselen finnes i følgende samling(er)

Institutt for elektroniske systemer [2333]

Vis enkel innførsel