Design of Detectors for Automatic Speech Recognition

Martínez del Hoyo Canterla, Alfonso

Martínez del Hoyo Canterla, Alfonso

Doctoral thesis

Åpne

528775_FULLTEXT02.pdf (775.3Kb)

528775_FULLTEXT01.pdf (Låst)

Permanent lenke

http://hdl.handle.net/11250/2370409

Utgivelsesdato

2012

Metadata

Vis full innførsel

Samlinger

Institutt for elektroniske systemer [2286]

Sammendrag

This thesis presents methods and results for optimizing subword detectors in continuous speech. Speech detectors are useful within areas like detection-based ASR, pronunciation training, phonetic analysis, word spotting, etc. Firstly, we propose a structure suitable for subword detection. This structure is based on the standard HMM framework, but in each detector the MFCC feature extractor and the models are trained for the specific detection problem. Our experiments in the TIMIT database validate the effectiveness of this structure for detection of phones and articulatory features.

Secondly, two discriminative training techniques are proposed for detector training. The first one is a modification of Minimum Classification Error training. The second one, Minimum Detection Error training, is the adaptation of Minimum Phone Error to the detection problem. Both methods are used to train HMMs and filterbanks in the detectors, isolated or jointly. MDE has the advantage that any detection performance criterion can be optimized directly. F-score and class accuracy optimization experiments show that MDE training is superior to the MCE-based method.

The optimized filterbanks reflect some acoustical properties of the detection classes. Moreover, some changes are consistent over classes with similar acoustical properties. In addition, MDE-training of filterbanks results in filters significatively different than in the standard filterbank. In fact, some filters extract information from different critical bands.

Finally, we propose a detection-based automatic speech recognition system. Detectors are built with the proposed HMM-based detection structure and trained discriminatively. The linguistic merger is based on an MLP/Viterbi decoder.

Utgiver

NTNU

Serie

Doctoral Theses at NTNU, 1503-8181; 2012:36