Speech Enhancement with Deep Neural Networks

Melve, Olav Klungsøyr

Melve, Olav Klungsøyr

Master thesis

Åpne

14453_FULLTEXT.pdf (3.313Mb)

14453_ATTACHMENT.zip (3.991Mb)

14453_COVER.pdf (1.556Mb)

Permanent lenke

http://hdl.handle.net/11250/2445657

Utgivelsesdato

2016

Metadata

Vis full innførsel

Samlinger

Institutt for elektroniske systemer [2334]

Sammendrag

This master thesis describes the implementation and evaluation of a promising approach

to speech enhancement based on deep neural networks. A baseline system was imple-

mented and trained using noisy data synthesized by combining speech from the TIMIT

database with white Gaussian noise and recorded background noise signals from the Au-

rora2 database. Several techniques for improving the system, some proposed in other pa-

pers and some original, were implemented and evaluated. The quality of the enhanced

speech has been assessed by comparison with the reference clean speech using mean

square error (MSE) in the short time log-magnitude spectrum, segmental signal-to-noise-

ratio estimates on the waveforms, and a ITU-T standard method called Perceptual Evalu-

ation of Speech Quality (PESQ).

Of the implemented techniques, using dropout during training was shown in a small

experiment to give better results for the MSE, but worse or no better results in terms

of PESQ. Another technique called global variance equalization had the opposite effect,

negatively affecting MSE, but significantly improving the PESQ results. An experiment

replacing the sigmoid activation functions of the deep neural network with the increasingly

popular rectified linear units indicated that the latter setup could achieve as good or better

performance without using greedy layer-wise pretraining.

In addition to comparing variations of the deep neural network, the speech enhance-

ment system was compared to a standard signal processing method called OM-LSA. The

deep neural network based system giving best performance resulted in superior PESQ

score, but, in some cases, worse segmental SNR than what was achieved with OM-LSA.

Two hybrid systems combining OM-LSA with the DNN system were proposed, and one

showed a significant improvement in MSE for test set with unseen noise over the DNN

alone. In terms of PESQ, however, using the DNN alone gave better results than both

hybrid systems.

Testing the DNN systems performance for different sound classes gave some more

insight into what the method is good, and less good, at. An attempt to understand the

parameters of the trained DNN led to a new interpretation of the system as one identifying

speech features in noise and choosing from a set of basis vectors how to best estimate

the clean speech log-magnitude spectrum. This was considered an interesting perspective

even if it might not be the correct interpretation.

Some limited subjective evaluation was performed by the student by listening to files

from the test set enhanced by the system. This revealed that the system performs very

well in certain cases, also for unseen noise, but results in distorted speech of low expe-

rienced quality in other. This was especially true for files dominated by noise that were

mismatched to the training data.

Utgiver

NTNU