• norsk
    • English
  • English 
    • norsk
    • English
  • Login
View Item 
  •   Home
  • Fakultet for informasjonsteknologi og elektroteknikk (IE)
  • Institutt for elektroniske systemer
  • View Item
  •   Home
  • Fakultet for informasjonsteknologi og elektroteknikk (IE)
  • Institutt for elektroniske systemer
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Nearest Neighbor Frame Classification for Articulatory Speech Recognition

Næss, Arild Brandrud
Doctoral thesis
View/Open
Naess, Arild_PhD.pdf (Locked)
thesis_artASR_ABN.pdf (9.022Mb)
URI
http://hdl.handle.net/11250/2371344
Date
2015
Metadata
Show full item record
Collections
  • Institutt for elektroniske systemer [1865]
Abstract
The paradigm of phone-based hidden Markov models has dominated automatic speech recognition since the early 1980s, and continuous improvements of this approach combined with the exponential increase in computational power have led to impressive improvements in the performance of such systems in the past 30 years. Of late, however, these gains have seemed to level off, and there is a growing interest in exploring alternative paradigms. This thesis concerns itself with two of these newer approaches: articulatory speech recognition and exemplar-based methods. Articulatory speech recognition considers speech not as a sequence of phones, but as an interplay between our articulators—the lips, the tongue, the glottis and the velum. This explicit modeling of the pronunciation process in the statistical framework of the speech recognizer allows for a better model of the pronunciation variation that occurs, particularly in spontaneous speech. Exemplar-based methods is a common name for all ways of using the training data directly rather than fitting a global statistical model to it. Most of these methods are based on finding nearest neighbors among the observation vectors. The main focus of this thesis is on the frame classification of articulatory features by nearest neighbors, and on using this classification to produce input feature vectors for two transcription systems. We consider nearest neighbor-based frame-level classification of a multi-valued set of articulatory features (AFs) inspired by the vocal tract variables of articulatory phonology. This entails that, for each frame of the audio signal, we try to determine the value of each of our eight AFs at the corresponding point in time. Partly for comparison purposes, we do a frame classification of phones in the same way. We explore a variety of linear and nonlinear transformations of the observation vectors, and use the k nearest neighbors in the resulting vector space to do the classification. Our best results compare favorably to a multilayer perceptron (MLP) baseline. Based on our k-nearest neigbhor (k-NN) frame classification, we make posterior-like feature vectors, which we incorporate into two systems for automatic transcription. The first of these is a conditional random field (CRF) for forced transcription of our set of AFs. The performance of our k-NN-based features in the CRF system is better than that of MLP-based features for most of the AFs, and on par with it for the rest of them. The second transcription system is a standard tandem hidden Markov model for phone recognition, where the k-NN-based features do not do as well as the MLP-based ones. Nevertheless, we argue that the flexibility and transparency of k-NN classification make it a very promising approach for articulatory speech recognition.
Publisher
NTNU
Series
Doctoral thesis at NTNU;2015:24

Contact Us | Send Feedback

Privacy policy
DSpace software copyright © 2002-2019  DuraSpace

Service from  Unit
 

 

Browse

ArchiveCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsDocument TypesJournalsThis CollectionBy Issue DateAuthorsTitlesSubjectsDocument TypesJournals

My Account

Login

Statistics

View Usage Statistics

Contact Us | Send Feedback

Privacy policy
DSpace software copyright © 2002-2019  DuraSpace

Service from  Unit