Speech Enhancement with Deep Neural Networks
Abstract
This master thesis describes the implementation and evaluation of a promising approachto speech enhancement based on deep neural networks. A baseline system was imple-mented and trained using noisy data synthesized by combining speech from the TIMITdatabase with white Gaussian noise and recorded background noise signals from the Au-rora2 database. Several techniques for improving the system, some proposed in other pa-pers and some original, were implemented and evaluated. The quality of the enhancedspeech has been assessed by comparison with the reference clean speech using meansquare error (MSE) in the short time log-magnitude spectrum, segmental signal-to-noise-ratio estimates on the waveforms, and a ITU-T standard method called Perceptual Evalu-ation of Speech Quality (PESQ).Of the implemented techniques, using dropout during training was shown in a smallexperiment to give better results for the MSE, but worse or no better results in termsof PESQ. Another technique called global variance equalization had the opposite effect,negatively affecting MSE, but significantly improving the PESQ results. An experimentreplacing the sigmoid activation functions of the deep neural network with the increasinglypopular rectified linear units indicated that the latter setup could achieve as good or betterperformance without using greedy layer-wise pretraining.In addition to comparing variations of the deep neural network, the speech enhance-ment system was compared to a standard signal processing method called OM-LSA. Thedeep neural network based system giving best performance resulted in superior PESQscore, but, in some cases, worse segmental SNR than what was achieved with OM-LSA.Two hybrid systems combining OM-LSA with the DNN system were proposed, and oneshowed a significant improvement in MSE for test set with unseen noise over the DNNalone. In terms of PESQ, however, using the DNN alone gave better results than bothhybrid systems.Testing the DNN systems performance for different sound classes gave some moreinsight into what the method is good, and less good, at. An attempt to understand theparameters of the trained DNN led to a new interpretation of the system as one identifyingspeech features in noise and choosing from a set of basis vectors how to best estimatethe clean speech log-magnitude spectrum. This was considered an interesting perspectiveeven if it might not be the correct interpretation.Some limited subjective evaluation was performed by the student by listening to filesfrom the test set enhanced by the system. This revealed that the system performs verywell in certain cases, also for unseen noise, but results in distorted speech of low expe-rienced quality in other. This was especially true for files dominated by noise that weremismatched to the training data.