Hybrid attention convolution acoustic model based on variational autoencoder for speech recognition

Tang, Haoyu

Tang, Haoyu

Master thesis

Åpne

no.ntnu:inspera:2527312.pdf (11.07Mb)

Permanent lenke

http://hdl.handle.net/11250/2624676

Utgivelsesdato

2019

Metadata

Vis full innførsel

Samlinger

Institutt for elektroniske systemer [2332]

Sammendrag

In a communication system, smart home and other speech-based application, Automatic Speech Recognition (ASR) play a crucial role, and its inputs usually are features extracted from a raw speech signal. A well-designed feature extractor could improve the accuracy and reduced the computation complexity.

The feature extraction is a basic processing unit not only in ASR but also in Text to Speech (TTS), Speaker conversion and other conversion systems. Meanwhile, through the analysis of the feature extracted with a labeled phone in each frame, the result could be used for improving TTS as well.

The goal of the project is to build a new acoustic model used in ASR system based on neural network. Moreover, in the case, compared with main streaming acoustics modules, the most significant difference is that autoencoder introduces which makes the training of acoustic module becoming semi-supervised learning from supervised learning.

Feature extraction usually like Mel-frequency cpectrum coefficient (MFCC), Linear predictive coding (LPC), Warped Minimum Variance Distortionless response cepstral coefficients (WMVDRCC) are based on standard digital signal processing while neural network based feature extraction is uncommon.

Moreover, Deep Neural Network (DNN) has been proved as an efficient ASR method used the DNN module as an acoustic module in the ASR system. Meanwhile, the acoustic module could be built with Convolution Neural Network (CNN) or Long short-term memory recurrent neural network (LSTM).

In the master thesis, a hybrid attention mechanism is brought into acoustic. More specifically, time-frequency attention weighting in autoencoder, a front stage of the acoustic model. And channel attention weighting in a phone classifier, postage of acoustic model.

For accelerating this semi-supervised later development in VGG-BLSTM structure, a regularization method is developed for dynamic loss combine weighting. In this method, a regularization item is added to loss, which not also dynamic control weighting between two loss, but also push the weight to a preset value for smooth and steady loss decreasing.

Utgiver

NTNU