Machine Learning of Sub-Phonemic Units for Speech Recognition
Abstract
This work is intended to explore the performance of a new set of acoustic model units in speech recognition. The acoustic models were built and evaluated from scratch in several steps: Feature extraction, acoustic detection and merging, acoustic segmentation of TIMIT corpus, clustering the segment representatives, assigning labels to each cluster and labelling the segments by cluster labels, and finally acoustic modeling. At the acoustic modeling phase, two experiments were investigated, using standard HMM structures and HTK toolkit; In the first experiment, the models were trained and evaluated by the annotated version of training data from TIMIT database in terms of cluster labels. In the second experiment, the time-aligned version of transcriptions was utilized to train acoustic models. Both experiments were carried out on four systems with 128, 256, 512 and 1024 units. Both single and mixture probability estimators were testified. In both experiments, the best results were achieved using GMMs with three-components for the 128 units system.