dc.description.abstract | This master thesis describes the implementation and evaluation of a promising approach
to speech enhancement based on deep neural networks. A baseline system was imple-
mented and trained using noisy data synthesized by combining speech from the TIMIT
database with white Gaussian noise and recorded background noise signals from the Au-
rora2 database. Several techniques for improving the system, some proposed in other pa-
pers and some original, were implemented and evaluated. The quality of the enhanced
speech has been assessed by comparison with the reference clean speech using mean
square error (MSE) in the short time log-magnitude spectrum, segmental signal-to-noise-
ratio estimates on the waveforms, and a ITU-T standard method called Perceptual Evalu-
ation of Speech Quality (PESQ).
Of the implemented techniques, using dropout during training was shown in a small
experiment to give better results for the MSE, but worse or no better results in terms
of PESQ. Another technique called global variance equalization had the opposite effect,
negatively affecting MSE, but significantly improving the PESQ results. An experiment
replacing the sigmoid activation functions of the deep neural network with the increasingly
popular rectified linear units indicated that the latter setup could achieve as good or better
performance without using greedy layer-wise pretraining.
In addition to comparing variations of the deep neural network, the speech enhance-
ment system was compared to a standard signal processing method called OM-LSA. The
deep neural network based system giving best performance resulted in superior PESQ
score, but, in some cases, worse segmental SNR than what was achieved with OM-LSA.
Two hybrid systems combining OM-LSA with the DNN system were proposed, and one
showed a significant improvement in MSE for test set with unseen noise over the DNN
alone. In terms of PESQ, however, using the DNN alone gave better results than both
hybrid systems.
Testing the DNN systems performance for different sound classes gave some more
insight into what the method is good, and less good, at. An attempt to understand the
parameters of the trained DNN led to a new interpretation of the system as one identifying
speech features in noise and choosing from a set of basis vectors how to best estimate
the clean speech log-magnitude spectrum. This was considered an interesting perspective
even if it might not be the correct interpretation.
Some limited subjective evaluation was performed by the student by listening to files
from the test set enhanced by the system. This revealed that the system performs very
well in certain cases, also for unseen noise, but results in distorted speech of low expe-
rienced quality in other. This was especially true for files dominated by noise that were
mismatched to the training data. | |