Implementation of a System for Automatic Speech Recognition of Child Speech
MetadataVis full innførsel
Most speech recognizers today are trained on adult speech corpora. Child speech differ from adult speech on several areas. These differences cause severe degradation in the performance of an ASR system developed for adult speakers when employed to recognize child speech. Recording sufficiently large child speech corpuses are expensive. NTNU has therefore developed a method which trains a child speech recognizer by transforming a database of adult speech so that it corresponds better with child speech. For this purpose speaker adaption techniques like VTLN and SAT are applied. ChildSR has an enormous potential for use in computer tools for speech and language development. While it may not be able to replace the teacher-pupil interaction it will vastly increase the assistance which a child gets. It also has a potential to help make computer technology available to new populations which have not used computers to its extent because of physical disabilities, or similar. Interactive entertainment and talking toys are examples of other applications. This work is intended to further develop and replace a system developed by NTNU for automatic speech recognition of child speech. The new system is developed with a focus on cross-platform compatibility, performance and efficiency. The original system was developed by D.R Sanand at NTNU for the article "Synthetic Speaker Models Using VTLN to Improve the Performance of Children in Mismatched Speaker Conditions for ASR". The original system was mainly written in Bash and Perl employing the Hidden Markov Toolkit. As the original system was written for research purposes it had many redundant modules. The first step was to strip down the old system to its bare necessities. Following that, to ensure cross-platform probability, the system was rewritten in Python. With the python system WER=11.67% was achieved, which is the same WER as the original system. This confirms that the python implementation is correct. For training the adult speech corpus TIMIT is used, testing is performed with the child speech corpus CMUKids. In order to increase ChildSR performance for the Python implementation experiments carried out to optimize decoding with HVite in the recognition of adapted test data. Choosing the right Word Insertion Penalties (PEN) and Grammar Scale Factors(SCALE) impacts recognition performance significantly. Tests were therefore run with 450 different combinations of PEN and SCALE to find the combination which minimized WER. The optimization resulted in WER=7.03, i.e a 39.7% improvement relative to the original system. In order to analyze the software further, run time measures were performed on the scripts to determine the duration of the different processes.