Hybrid attention convolution acoustic model based on variational autoencoder for speech recognition

Tang, Haoyu

dc.contributor.advisor	Svendsen, Torbjørn
dc.contributor.advisor	Imram, Ali
dc.contributor.advisor	Shahrebabaki, AdoIreza
dc.contributor.author	Tang, Haoyu
dc.date.accessioned	2019-10-26T14:04:16Z
dc.date.available	2019-10-26T14:04:16Z
dc.date.issued	2019
dc.identifier.uri	http://hdl.handle.net/11250/2624676
dc.description.abstract
dc.description.abstract	In a communication system, smart home and other speech-based application, Automatic Speech Recognition (ASR) play a crucial role, and its inputs usually are features extracted from a raw speech signal. A well-designed feature extractor could improve the accuracy and reduced the computation complexity. The feature extraction is a basic processing unit not only in ASR but also in Text to Speech (TTS), Speaker conversion and other conversion systems. Meanwhile, through the analysis of the feature extracted with a labeled phone in each frame, the result could be used for improving TTS as well. The goal of the project is to build a new acoustic model used in ASR system based on neural network. Moreover, in the case, compared with main streaming acoustics modules, the most significant difference is that autoencoder introduces which makes the training of acoustic module becoming semi-supervised learning from supervised learning. Feature extraction usually like Mel-frequency cpectrum coefficient (MFCC), Linear predictive coding (LPC), Warped Minimum Variance Distortionless response cepstral coefficients (WMVDRCC) are based on standard digital signal processing while neural network based feature extraction is uncommon. Moreover, Deep Neural Network (DNN) has been proved as an efficient ASR method used the DNN module as an acoustic module in the ASR system. Meanwhile, the acoustic module could be built with Convolution Neural Network (CNN) or Long short-term memory recurrent neural network (LSTM). In the master thesis, a hybrid attention mechanism is brought into acoustic. More specifically, time-frequency attention weighting in autoencoder, a front stage of the acoustic model. And channel attention weighting in a phone classifier, postage of acoustic model. For accelerating this semi-supervised later development in VGG-BLSTM structure, a regularization method is developed for dynamic loss combine weighting. In this method, a regularization item is added to loss, which not also dynamic control weighting between two loss, but also push the weight to a preset value for smooth and steady loss decreasing.
dc.language	eng
dc.publisher	NTNU
dc.title	Hybrid attention convolution acoustic model based on variational autoencoder for speech recognition
dc.type	Master thesis

Tilhørende fil(er)

Filnavn:: no.ntnu:inspera:2527312.pdf
Størrelse:: 11.07Mb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Institutt for elektroniske systemer [2332]

Vis enkel innførsel