• norsk
    • English
  • English 
    • norsk
    • English
  • Login
View Item 
  •   Home
  • Fakultet for informasjonsteknologi og elektroteknikk (IE)
  • Institutt for elektroniske systemer
  • View Item
  •   Home
  • Fakultet for informasjonsteknologi og elektroteknikk (IE)
  • Institutt for elektroniske systemer
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Hybrid attention convolution acoustic model based on variational autoencoder for speech recognition

Tang, Haoyu
Master thesis
Thumbnail
View/Open
no.ntnu:inspera:2527312.pdf (11.07Mb)
URI
http://hdl.handle.net/11250/2624676
Date
2019
Metadata
Show full item record
Collections
  • Institutt for elektroniske systemer [2498]
Abstract
 
 
In a communication system, smart home and other speech-based application, Automatic Speech Recognition (ASR) play a crucial role, and its inputs usually are features extracted from a raw speech signal. A well-designed feature extractor could improve the accuracy and reduced the computation complexity.

The feature extraction is a basic processing unit not only in ASR but also in Text to Speech (TTS), Speaker conversion and other conversion systems. Meanwhile, through the analysis of the feature extracted with a labeled phone in each frame, the result could be used for improving TTS as well.

The goal of the project is to build a new acoustic model used in ASR system based on neural network. Moreover, in the case, compared with main streaming acoustics modules, the most significant difference is that autoencoder introduces which makes the training of acoustic module becoming semi-supervised learning from supervised learning.

Feature extraction usually like Mel-frequency cpectrum coefficient (MFCC), Linear predictive coding (LPC), Warped Minimum Variance Distortionless response cepstral coefficients (WMVDRCC) are based on standard digital signal processing while neural network based feature extraction is uncommon.

Moreover, Deep Neural Network (DNN) has been proved as an efficient ASR method used the DNN module as an acoustic module in the ASR system. Meanwhile, the acoustic module could be built with Convolution Neural Network (CNN) or Long short-term memory recurrent neural network (LSTM).

In the master thesis, a hybrid attention mechanism is brought into acoustic. More specifically, time-frequency attention weighting in autoencoder, a front stage of the acoustic model. And channel attention weighting in a phone classifier, postage of acoustic model.

For accelerating this semi-supervised later development in VGG-BLSTM structure, a regularization method is developed for dynamic loss combine weighting. In this method, a regularization item is added to loss, which not also dynamic control weighting between two loss, but also push the weight to a preset value for smooth and steady loss decreasing.
 
Publisher
NTNU

Contact Us | Send Feedback

Privacy policy
DSpace software copyright © 2002-2019  DuraSpace

Service from  Unit
 

 

Browse

ArchiveCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsDocument TypesJournalsThis CollectionBy Issue DateAuthorsTitlesSubjectsDocument TypesJournals

My Account

Login

Statistics

View Usage Statistics

Contact Us | Send Feedback

Privacy policy
DSpace software copyright © 2002-2019  DuraSpace

Service from  Unit