Show simple item record

dc.contributor.advisorJohnsen, Magne Hallstein
dc.contributor.advisorBirkenes, Øystein
dc.contributor.authorMelve, Olav Klungsøyr
dc.date.accessioned2017-06-10T14:00:31Z
dc.date.available2017-06-10T14:00:31Z
dc.date.created2016-06-10
dc.date.issued2016
dc.identifierntnudaim:14453
dc.identifier.urihttp://hdl.handle.net/11250/2445657
dc.description.abstractThis master thesis describes the implementation and evaluation of a promising approach to speech enhancement based on deep neural networks. A baseline system was imple- mented and trained using noisy data synthesized by combining speech from the TIMIT database with white Gaussian noise and recorded background noise signals from the Au- rora2 database. Several techniques for improving the system, some proposed in other pa- pers and some original, were implemented and evaluated. The quality of the enhanced speech has been assessed by comparison with the reference clean speech using mean square error (MSE) in the short time log-magnitude spectrum, segmental signal-to-noise- ratio estimates on the waveforms, and a ITU-T standard method called Perceptual Evalu- ation of Speech Quality (PESQ). Of the implemented techniques, using dropout during training was shown in a small experiment to give better results for the MSE, but worse or no better results in terms of PESQ. Another technique called global variance equalization had the opposite effect, negatively affecting MSE, but significantly improving the PESQ results. An experiment replacing the sigmoid activation functions of the deep neural network with the increasingly popular rectified linear units indicated that the latter setup could achieve as good or better performance without using greedy layer-wise pretraining. In addition to comparing variations of the deep neural network, the speech enhance- ment system was compared to a standard signal processing method called OM-LSA. The deep neural network based system giving best performance resulted in superior PESQ score, but, in some cases, worse segmental SNR than what was achieved with OM-LSA. Two hybrid systems combining OM-LSA with the DNN system were proposed, and one showed a significant improvement in MSE for test set with unseen noise over the DNN alone. In terms of PESQ, however, using the DNN alone gave better results than both hybrid systems. Testing the DNN systems performance for different sound classes gave some more insight into what the method is good, and less good, at. An attempt to understand the parameters of the trained DNN led to a new interpretation of the system as one identifying speech features in noise and choosing from a set of basis vectors how to best estimate the clean speech log-magnitude spectrum. This was considered an interesting perspective even if it might not be the correct interpretation. Some limited subjective evaluation was performed by the student by listening to files from the test set enhanced by the system. This revealed that the system performs very well in certain cases, also for unseen noise, but results in distorted speech of low expe- rienced quality in other. This was especially true for files dominated by noise that were mismatched to the training data.
dc.languageeng
dc.publisherNTNU
dc.subjectElektronikk, Signalbehandling
dc.titleSpeech Enhancement with Deep Neural Networks
dc.typeMaster thesis


Files in this item

Thumbnail
Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record