Vis enkel innførsel

dc.contributor.advisorSvendsen, Torbjørn
dc.contributor.advisorJohnsen, Magne H.
dc.contributor.advisorSiniscalchi, Sabato Marco
dc.contributor.authorShahrebabaki, Abdolreza Sabzi
dc.date.accessioned2022-06-22T08:25:06Z
dc.date.available2022-06-22T08:25:06Z
dc.date.issued2022
dc.identifier.isbn978-82-326-6324-8
dc.identifier.issn2703-8084
dc.identifier.urihttps://hdl.handle.net/11250/2999962
dc.description.abstractWithin the past decades advances in neural networks have improved the performance of a vast area of speech processing applications including the articulatory inversion problem which is concerned with estimating the vocal tract shape in the form of articulators’ position based on the uttered speech. In spite of these improvements the articulatory inversion problem still needs improvements in order to be further utilized for other speech application as a complementary source of information. Articulatory measurements have been employed in various applications such as speech synthesis, computer aided pronunciation training and automatic speech recognition. Measuring articulator movements requires complex procedures and systems, which makes it impossible to perform measurements outside of labs. There are databases containing a limited number of speakers which have synchronously recorded articulator movements and uttered speech. This thesis explores the articulatory inversion problem within different scenarios where there are mismatches between training data and test data. These mismatches include speaker mismatches within a database or across databases, mismatches in the speaking rate of speakers, and mismatches in the environment where the data are synthetically created by incorporating various noises. The first part of the thesis focus on incorporating linguistic information such as forced aligned phonemic features, attribute features based on manner and place of articulation, and their combination with the acoustic features. Furthermore, new architectures are developed based on the acoustic landmarks theory which tells that abrupt changes in the speech spectrum are the results of changes in the articulators’ configuration. Later on, transfer learning of articulatory information based on phonemic features is utilized to generate articulatory trajectories for the TIMIT database. Phone recognition experiments provide evidence of the effectiveness of the proposed transfer learning approach. Furthermore, a novel architecture is proposed to estimate articulatory trajectories directly from the time domain speech signal by utilizing 1D convolutional filters. The 1D convolutional layers extract features and the decimation operators match the sampling rate of acoustic signal with the articulatory measurements’ sampling rate. The data driven features extracted by 1D convolutional layers are better able to capture and compensate the variability resulted by mismatch in the speaking rates. In the second part of the thesis the focus is on articulatory inversion performance in noisy conditions. Synthetically produced noisy acoustic data are used for this experiment evaluation. Speech enhancement based on deep neural networks prior to the articulatory inversion trained on clean data, slightly outperforms the articulatory inversion system trained on multi-condition noisy data. We propose a joint network which performs both speech enhancement and articulatory inversion. The articulatory inversion part of the joint model outperforms the trained model on multi-condition noisy data in the low signal to noise ratio range, namely 0, 5 and 10 dB. The estimated articulatory data are further used to train a word recognition system trained on clean acoustic and articulatory features for the WSJ dataset. For the noisy condition, the word error rate of the recognition system trained on both acoustic and articulatory data is significantly less than the model trained only on the clean acoustic data.en_US
dc.language.isoengen_US
dc.publisherNTNUen_US
dc.relation.ispartofseriesDoctoral theses at NTNU;2022:198
dc.relation.haspartPaper A: Sabzi Shahrebabaki, Abdolreza; Olfati, Negar; Imran, Ali Shariq; Sabato Marco, Siniscalchi; Svendsen, Torbjørn Karl. A Phonetic-Level Analysis of Different Input Features for Articulatory Inversion. Interspeech (USB) 2019 s. 3775-3779 https://doi.org/10.21437/Interspeech.2019-2526en_US
dc.relation.haspartPaper B: Sabzi Shahrebabaki, Abdolreza; Siniscalchi, Marco; Salvi, Giampiero; Svendsen, Torbjørn Karl. Sequence-to-sequence articulatory inversion through time convolution of sub-band frequency signals. Interspeech (USB) 2020 s. 2882-2886 https://doi.org/10.21437/Interspeech.2020-1140en_US
dc.relation.haspartPaper C: Sabzi Shahrebabaki, Abdolreza; Olfati, Negar; Siniscalchi, Sabato Marco; Salvi, Giampiero; Svendsen, Torbjørn. Transfer learning of articulatory information through phone information.. Interspeech (USB) 2020 s. 2877-2881 https://doi.org/10.21437/Interspeech.2020-1139en_US
dc.relation.haspartPaper D: Sabzi Shahrebabaki, Abdolreza; Siniscalchi, Sabato Marco; Salvi, Giampiero; Svendsen, Torbjørn Karl. A DNN Based Speech Enhancement Approach to Noise Robust Acoustic-to-Articulatory Inversion. I: Proceedings 2021 IEEE International Symposium on Circuits and Systems. https://doi.org/10.1109/ISCAS51556.2021.9401290en_US
dc.relation.haspartPaper E: Sabzi Shahrebabaki, Abdolreza; Salvi, Giampiero; Svendsen, Torbjørn Karl; Siniscalchi, Sabato Marco. Acoustic-to-Articulatory Mapping With Joint Optimization of Deep Speech Enhancement and Articulatory Inversion Models. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP) 2021 ;Volum 30. https://doi.org/Sabzi Shahrebabaki, Abdolreza; Salvi, Giampiero; Svendsen, Torbjørn Karl; Siniscalchi, Sabato Marco. Acoustic-to-Articulatory Mapping With Joint Optimization of Deep Speech Enhancement and Articulatory Inversion Models. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP) 2021 ;Volum 30. CCBYen_US
dc.relation.haspartPaper F: Sabzi Shahrebabaki, Abdolreza; Siniscalchi, Sabato Marco; Svendsen, Torbjørn Karl. Raw Speech-to-Articulatory Inversion by Temporal Filtering and Decimation. Interspeech 2021 https://doi.org/10.21437/Interspeech.2021-1429en_US
dc.titleArticulatory Inversion for Speech Technology Applicationsen_US
dc.typeDoctoral thesisen_US


Tilhørende fil(er)

Thumbnail
Thumbnail

Denne innførselen finnes i følgende samling(er)

Vis enkel innførsel