Nearest Neighbor Frame Classification for Articulatory Speech Recognition

Næss, Arild Brandrud

Næss, Arild Brandrud

Doctoral thesis

Åpne

Naess, Arild_PhD.pdf (Låst)

thesis_artASR_ABN.pdf (9.022Mb)

Permanent lenke

http://hdl.handle.net/11250/2371344

Utgivelsesdato

2015

Metadata

Vis full innførsel

Samlinger

Institutt for elektroniske systemer [2334]

Sammendrag

The paradigm of phone-based hidden Markov models has dominated automatic speech recognition since the early 1980s, and continuous improvements of this approach combined with the exponential increase in computational power have led to impressive improvements in the performance of such systems in the past 30 years. Of late, however, these gains have seemed to level off, and there is a growing interest in exploring alternative paradigms. This thesis concerns itself with two of these newer approaches: articulatory speech recognition and exemplar-based methods. Articulatory speech recognition considers speech not as a sequence of phones, but as an interplay between our articulators—the lips, the tongue, the glottis and the velum. This explicit modeling of the pronunciation process in the statistical framework of the speech recognizer allows for a better model of the pronunciation variation that occurs, particularly in spontaneous speech. Exemplar-based methods is a common name for all ways of using the training data directly rather than fitting a global statistical model to it. Most of these methods are based on finding nearest neighbors among the observation vectors. The main focus of this thesis is on the frame classification of articulatory features by nearest neighbors, and on using this classification to produce input feature vectors for two transcription systems. We consider nearest neighbor-based frame-level classification of a multi-valued set of articulatory features (AFs) inspired by the vocal tract variables of articulatory phonology. This entails that, for each frame of the audio signal, we try to determine the value of each of our eight AFs at the corresponding point in time. Partly for comparison purposes, we do a frame classification of phones in the same way. We explore a variety of linear and nonlinear transformations of the observation vectors, and use the k nearest neighbors in the resulting vector space to do the classification. Our best results compare favorably to a multilayer perceptron (MLP) baseline. Based on our k-nearest neigbhor (k-NN) frame classification, we make posterior-like feature vectors, which we incorporate into two systems for automatic transcription. The first of these is a conditional random field (CRF) for forced transcription of our set of AFs. The performance of our k-NN-based features in the CRF system is better than that of MLP-based features for most of the AFs, and on par with it for the rest of them. The second transcription system is a standard tandem hidden Markov model for phone recognition, where the k-NN-based features do not do as well as the MLP-based ones. Nevertheless, we argue that the flexibility and transparency of k-NN classification make it a very promising approach for articulatory speech recognition.

Utgiver

NTNU

Serie

Doctoral thesis at NTNU;2015:24