Articulatory Inversion for Speech Technology Applications

Shahrebabaki, Abdolreza Sabzi

Shahrebabaki, Abdolreza Sabzi

Doctoral thesis

Åpne

Abdolreza Sabzi Shahrebabaki PhD (9.579Mb)

Shahrebabaki (Låst)

Permanent lenke

https://hdl.handle.net/11250/2999962

Utgivelsesdato

2022

Metadata

Vis full innførsel

Samlinger

Institutt for elektroniske systemer [2333]

Sammendrag

Within the past decades advances in neural networks have improved the performance of a vast area of speech processing applications including the articulatory inversion problem which is concerned with estimating the vocal tract shape in the form of articulators’ position based on the uttered speech. In spite of these improvements the articulatory inversion problem still needs improvements in order to be further utilized for other speech application as a complementary source of information. Articulatory measurements have been employed in various applications such as speech synthesis, computer aided pronunciation training and automatic speech recognition. Measuring articulator movements requires complex procedures and systems, which makes it impossible to perform measurements outside of labs. There are databases containing a limited number of speakers which have synchronously recorded articulator movements and uttered speech.

This thesis explores the articulatory inversion problem within different scenarios where there are mismatches between training data and test data. These mismatches include speaker mismatches within a database or across databases, mismatches in the speaking rate of speakers, and mismatches in the environment where the data are synthetically created by incorporating various noises.

The first part of the thesis focus on incorporating linguistic information such as forced aligned phonemic features, attribute features based on manner and place of articulation, and their combination with the acoustic features. Furthermore, new architectures are developed based on the acoustic landmarks theory which tells that abrupt changes in the speech spectrum are the results of changes in the articulators’ configuration. Later on, transfer learning of articulatory information based on phonemic features is utilized to generate articulatory trajectories for the TIMIT database. Phone recognition experiments provide evidence of the effectiveness of the proposed transfer learning approach. Furthermore, a novel architecture is proposed to estimate articulatory trajectories directly from the time domain speech signal by utilizing 1D convolutional filters. The 1D convolutional layers extract features and the decimation operators match the sampling rate of acoustic signal with the articulatory measurements’ sampling rate. The data driven features extracted by 1D convolutional layers are better able to capture and compensate the variability resulted by mismatch in the speaking rates.

In the second part of the thesis the focus is on articulatory inversion performance in noisy conditions. Synthetically produced noisy acoustic data are used for this experiment evaluation. Speech enhancement based on deep neural networks prior to the articulatory inversion trained on clean data, slightly outperforms the articulatory inversion system trained on multi-condition noisy data. We propose a joint network which performs both speech enhancement and articulatory inversion. The articulatory inversion part of the joint model outperforms the trained model on multi-condition noisy data in the low signal to noise ratio range, namely 0, 5 and 10 dB. The estimated articulatory data are further used to train a word recognition system trained on clean acoustic and articulatory features for the WSJ dataset. For the noisy condition, the word error rate of the recognition system trained on both acoustic and articulatory data is significantly less than the model trained only on the clean acoustic data.

Består av

Paper A: Sabzi Shahrebabaki, Abdolreza; Olfati, Negar; Imran, Ali Shariq; Sabato Marco, Siniscalchi; Svendsen, Torbjørn Karl. A Phonetic-Level Analysis of Different Input Features for Articulatory Inversion. Interspeech (USB) 2019 s. 3775-3779 https://doi.org/10.21437/Interspeech.2019-2526

Paper B: Sabzi Shahrebabaki, Abdolreza; Siniscalchi, Marco; Salvi, Giampiero; Svendsen, Torbjørn Karl. Sequence-to-sequence articulatory inversion through time convolution of sub-band frequency signals. Interspeech (USB) 2020 s. 2882-2886 https://doi.org/10.21437/Interspeech.2020-1140

Paper C: Sabzi Shahrebabaki, Abdolreza; Olfati, Negar; Siniscalchi, Sabato Marco; Salvi, Giampiero; Svendsen, Torbjørn. Transfer learning of articulatory information through phone information.. Interspeech (USB) 2020 s. 2877-2881 https://doi.org/10.21437/Interspeech.2020-1139

Paper D: Sabzi Shahrebabaki, Abdolreza; Siniscalchi, Sabato Marco; Salvi, Giampiero; Svendsen, Torbjørn Karl. A DNN Based Speech Enhancement Approach to Noise Robust Acoustic-to-Articulatory Inversion. I: Proceedings 2021 IEEE International Symposium on Circuits and Systems. https://doi.org/10.1109/ISCAS51556.2021.9401290

Paper E: Sabzi Shahrebabaki, Abdolreza; Salvi, Giampiero; Svendsen, Torbjørn Karl; Siniscalchi, Sabato Marco. Acoustic-to-Articulatory Mapping With Joint Optimization of Deep Speech Enhancement and Articulatory Inversion Models. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP) 2021 ;Volum 30. https://doi.org/Sabzi Shahrebabaki, Abdolreza; Salvi, Giampiero; Svendsen, Torbjørn Karl; Siniscalchi, Sabato Marco. Acoustic-to-Articulatory Mapping With Joint Optimization of Deep Speech Enhancement and Articulatory Inversion Models. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP) 2021 ;Volum 30. CCBY

Paper F: Sabzi Shahrebabaki, Abdolreza; Siniscalchi, Sabato Marco; Svendsen, Torbjørn Karl. Raw Speech-to-Articulatory Inversion by Temporal Filtering and Decimation. Interspeech 2021 https://doi.org/10.21437/Interspeech.2021-1429

Utgiver

NTNU

Serie

Doctoral theses at NTNU;2022:198