Robust Speech Recognition in the Presence of Additive Noise
Abstract
It is well known that additive noise can cause a significant decrease in performance for an automatic speech recognition (ASR) system. For an ASR system to maintain an acceptable level of performance in noisy conditions, measures must be taken to make it robust. Since prior information about the noise is usually not available, this information typically has to be obtained from the observed noisy utterance that is to be recognized.
Model compensation is one way of achieving robustness towards noise. One of the main problems with model compensation is how to approximate the non-linear relationship between speech, noise, and noisy speech in the log-spectral domain. In an effort to investigate the effects of approximation accuracy, a comparative study of two existing and one new method for approximating this relationship is presented. The study shows that, although the approximation methods differ in accuracy on a one-dimensional example, the recognition results on Aurora2 are almost equal in practice.
Due to several factors, the noisy speech parameter estimates obtained when performing model compensation will normally be uncertain, limiting the attainable performance. We propose a new model compensation approach, in which a robust decision rule is combined with traditional parallel model combination (PMC) to compensate for uncertainty. Experiments show that the proposed approach is effective in increasing performance at low signal-to-noise ratios (SNRs) for most noise types compared to PMC.
Another way of improving ASR performance in noisy conditions is by applying a feature enhancement algorithm prior to recognition. Many existing feature enhancement techniques rely on probabilistic models of speech and noise. Thus, the performance is influenced by the quality of these models. Traditionally, the probabilistic models have been trained using maximum likelihood estimation. This dissertation investigates the use of an alternative estimation method for prior speech models, namely Bayesian learning. It is shown that, within the chosen experimental setup, Bayesian learning can be used for model selection, and that the recognition performance is comparable to the performance obtained with maximum likelihood in most cases.
A good probabilistic model for the noise can be difficult to obtain, since it usually has to be estimated directly from the utterance at hand. In order to improve the quality of the noise model used by the feature enhancement algorithm, we investigate the use of voice activity detection (VAD) to obtain information about the noise. An advantage of the proposed VAD approach is that it works in the same domain as the speech recognizer. Experiments show that the VAD approach on average obtains a 10.8% error rate reduction compared to simply using a speech-free segment from the beginning of the utterance for noise modeling.