Voice Conversion using Deep Learning

This thesis aims to implement a voice conversion system that transforms one persons voice into another persons voice. Mel Frequency Cepstral Coefficients are used as coefficients for one set of tests, while STRAIGHT spectrogram is tried out as another set of features. The system is built using an artificial neural network approach when mapping the features from one speaker to the other.

Training the system is done using first around 300 sentences from 6 speakers that will not be used for testing. This builds a speaker independet stacked autoencoder that is used as a pre-training for the complete network. The encoder and decoder part of the stacked autoencoder is then separated by a shallow artificial neural network mapping layer, mapping features from the source speaker to the target speaker. This is done using only 2, or 70 sentences each from these 2 speakers. Finally, the complete network when combining the stacked autoencoder with the shallow artificial neural network is trained, also using 2 or 70 sentences.

The performance of the individual autoencoders, the complete stacked autoencoder and the complete network has been tested using mel cepstral distortion.

The complete network, when putting everything together was unable to train properly.

Utgiver

NTNU