Talegjenkjenning av barnestemmer

Today s speech recognition systems are based on adult speech corpora. In speech recognition of children s speech, the recognition system will not perform equally good as in the recognition of adult speech. This is due to large variations betwe- en the characteristics of adult speech and child speech. There are few available databases with child s speech, and it would be an expensive and time-consuming process to produce such databases. In this master thesis there will therefore be created a speech recognition system for children based on existing adult speech corpora.

In the spring of 2016 a speech recognition system for children was created at NTNU in conjunction of a master thesis. The system was implemented with the speech tool Hidden Markov Toolkit (HTK), and it used training techniques such as Vo- cal Tract Length Normalization (VTLN) and Speaker Adaptiv Training (SAT). The speech recognition system performed well with childen s speech corpora, and had a word error rate WER = 11.7%. HTK is out of date, and the goal of this master thesis is to replace the HTK-toolkit with a newer toolkit Kaldi.

Kaldi differs in the way that HTK is built, and the recognition system created in this task is therefore independent of the previously implemented HTK-system. Similar training methods (VTLN and SAT) are used, and a language model and grammar file is created for recognition. The evaluation methods used in the two systems are the same, and the speech recognition system implemented in this task performs a word error rate on 36.1% with training and recognition of VTLN and SAT. The difference in word error rate between the two systems is 24.4%, which is too high. Expected results was about the same as the HTK-implemented system at 11.7%, or even better. A possible source error could be the generated language model and its grammar file. There are many ways to create a language model, and the word weighting could be generated wrongly and result in a poor word error rate.

By further testing of a speech recognition system for children in Kaldi, it would be wise to replace the TIMIT corpora with another adult data base. TIMIT works best with phonetic training and contains many complex sentences. A new adult database should be able to train and recognise words, in order to simplify the system.

Utgiver

NTNU