## Nonlinear dynamical systems for automatic speech recognition

##### Doctoral thesis

##### Permanent lenke

http://hdl.handle.net/11250/228280##### Utgivelsesdato

2001##### Metadata

Vis full innførsel##### Samlinger

##### Sammendrag

In this thesis, we investigate the possibility of using some ideas from nonlinear dynamical systems theory in practical ASR.
The work presented in this theses is centered around a speech production model called Chained Dynamical System Model (CDSM), which is motivated by the theoretical limitations of the mainstream ASR approaches. The CDSM is essentially a smoothly time varying continuous-state nonlinear dynamical system, consisting of two sub dynamical systems coupled as a chain so that one system controls the parameters of the next system. The dynamical system at the lower level represents an articulator model whose input and output are the articulator configuration vactor and the speech waveform respectively. The higher level dynamical system is a model which outputs the articulator configuration vector for a given motor command or a gestural target input.
With the CDSM, the speech recognition problem can be posed as finding the motor command or geestural target sequence for a given speech waveform. This is nothing else but inverting the CDSM. We propose a solution to this inversion problem, which is based on Embedding theory. The resulting architecture, which we call Inverted CDSM (ICDSM) consists of four main components; A projection function (Fp), a prediction function (F2), a set of zero mean Gaussian probability distributions (Pr) and a set of control vectors (A). The parameters of the system are learned from a dataset using a global gradient based training scheme.
A set of experiments, involving a speaker independent, 39-class, continuous phoneme recognition task on the TIMIT database, is performed to evaluate the capability of the ICDSM in practical ASR. During these experiments, a special attention is paid to its ability to cope with co-articulation effects, which is theoretically shown to be good.
We start with some experiments, which deal with different initializaton and training procedures, as well as efficient ways for implementing the function Fp. The system gives the best results, about 64% accurancy, when a recurrent network performs the function of Fp. We can obtain an accurancy of about 62%, using a traditonal MLP with a lesser computational burdan.
One of the weaknesses of the ICDSM in its current form is that it does not make use of any static information in the state space. To alleviate this weakness, we propose a simple hybrid of the ICDSM and a uaual HMM, where the state conditioned likelihoods are combined. This system gives improved results, about 67% accurancy for the recurrent net Fp and about 65% for the MLP.
Another series of experiments is performed to study the ICDSM in the context of evironment and speaker variation modeling. The techniques considered here include enhancing built-in invariance as well as mismatch reduction through adaptation. One key property of the ICDSM, which is investigated in most of these experimets, is the possibility of transferring variation effects from one functional component to another.
To compensate for speaker variations, a multiple mixture component architecture for the function F2 is proposed. Such a system realized with 3-mixture components gives about 65% recognition accurancy, which is an improvement of accuracy by about 3% over the single mixture system.
A simple ' re-training on adaptation data' type scheme is tested with a sex adaptation task. This revealed that adaptaion fo the function Fp is more effective than F2 and/or A, resulting in an accuracy improvement about 2%. In an unsupervised adaptation task, where Fp is dropped out because of its high complexity, adaptation of F2 and A also gives similar omprovments.
In another experiment, a normalization procedure for time scale warping is studied. This procedure is based on a transformation, which maps the actual time scale onto the reconstructed trajectory of the state vector it self, utilizing the idea that the trajectory is the same even if the rate with which it is traversed is different. However the ICDSM architecture realizing this normalization procedure gives only a slight 1% of accuracy improvement.
As the ICDSM architecture realising this normalization procedure gives only a slight 1% of accuracy improvment.
As the ICDSM operates directly in the waveform space, it is well suited for attempting noise robust ASR. In fact, the operation performed by the fuction Fp can be seen as a generalization of the eigen-based subspace noise removel technique. The same task of 39-class phoneme recognition as above, but with NTIMIT database is selected to evaluate the noise robustness of the ICDSM. When trained and tested on NTIMIT, the ICDSM outperforms the HMM system used as the reference. This observation remains valid in the case where the systems are trained on TIMIT, but tested on NTIMIT. Employing an adaptation scheme, similar to the one used in speaker variation case, leads to further improvements in such a situation. However, as was the case earlier, adaptation of Fp seems to be giving higher improvements than the other componets F2 and/or A.
The main conclusion of the work is that even though the performances fo the ICDSM system in its current form are comparable to those of the corresponding systems in the literature, it still lags behind the best systems by a considerable margin of recognition accurancy. However there is much room for further development of the ICDSM and systems based on related ideas.