Classification of noise relative to speech

The implementation of the source localization in MATLAB showed satisfactory results, where the system was able to determine from which directions sound sources were coming from. However, the system only found the strongest sound sources without taking into account which type of sound it was. Due to this, it was necessary to imple- ment sound classification to distinguish between speech and non-speech.

The methods that were used in this thesis were Mel-frequency cepstral coefficients for feature extraction, and Gaussian mixture models for classification. Two systems were developed for classification, each with a speech model with 512 mixtures and a noise model. One noise model had 2048 mixtures and the other noise model had one mixture. Based on the log likelihood of each model output, the performance was measured. The system performance showed to be poor for inputs of one frame of 25ms, with an equal error rate above 30% for both systems. The more frames that were used as inputs, the higher the accuracy was and the more robust the threshold value became. Both systems reached 100% accuracy when the input sequences had a length of up to several seconds. Silence was classified as noise, which was optimal as it was undesirable for the system to focus on silence.

The system performance was expected to be improved from model adaption. If only the speech model was adapted, the system with the 1-mixture noise model showed very positive improvements with an average of 8.6% decrease in the error rate, reaching 5.6% error rate at 500ms inputs. If the noise model was adapted too, the system was expected to be perform even better. However, it was not possible to obtain an adaption matrix for this noise model as the mean values of the model was too small. The system with the 2048-mixture noise model showed a small improvement when adapting the speech model, and the system was further slightly improved by adapting both the speech and noise model. However, the improvement was not as big as expected.

The difference in performance between the two noise models were not significant enough to choose one over the other. In general, if more data, especially keyboard noise, were obtained for training, adaption and testing, it is likely that the system performance would show clear improvements. There were a lack of adaption data, in particular for the noise model adaption, resulting in an adaption matrix that was not entirely optimal.

The system reaching 5.6% error rate at 500ms input is working well for offline use, but improvements would need to be made if either one of the systems were to be used in real-time. The results for short input sequences are currently not adequate as the error rate is too high for providing good accuracy and thereby a good user experience.

Utgiver

NTNU