Classifying motion picture audio
MetadataShow full item record
- Institutt for design 
ENGELSK: Classification of the audio track, of a motion picture, into traditional audio classes is a challenging task. The reason for this is the amount of mixed content. Speech has often background music or environmental sounds, and music has often background environmental sounds or speech. Traditional methods for separating clear audio classes have limited performance on mixed audio content. New methods and tools for automatic classification of this type of audio are therefore needed. This project investigates combinations of low level descriptors, dimensionality reduction by PCA and classification by KNN. A feature set consisting of Audio Power (AP), Audio Wave Form (AWF), Root- Mean-Square (RMS), Short Time Energy (STE), Low Short-Time Energy Ratio, Zero- Crossing Rate (ZCR), High Zero-Crossing Rate Ratio (HZCRR) in the time domain and Audio Spectrum Centroid (ASC), Fundamental Frequency (FuF), Mel, Frequency Cepstral Coefficients (MFCC), Spectrum Flux (SF) in the frequency domain is extracted on 30ms windows and integrated over a 1.2 second frame to yield a 23-dimmensional feature vector. The most suitable combination for separating speech from background music were; AP, ASC, AWF, STE, RMS, SF, fourth MFCC, AP(min scalar), ZCR(min scalar) and STE(min scalar). The combination has a majority of descriptors from the time domain. Most suitable combination of LLDs to separate speech with background environmental sounds from clear environmental sounds was found to be the 1th, 4th, 5th, 6th, 8th and 9th Mel Frequency cepstran coefficients. MFCC is in the frequency domain. Best results were achieved when the PCA returned 3 dimensions, and when the KNN classified the samples based on the 4 closest neighbors. The results from testing different mixtures of speech and music, to find the boundary where speech with background music no longer is categorized as speech, showed that the music signal had to be minimum 8 dB below the speech signal to be classified as speech. Some of the low level descriptors which traditionally performs well when separating clear classes, performed poorly in the experiments with mixed classes. Especially ZCR and FuF failed to separate speech with background music from clear music. A final experiment classifies the audio track from the motion picture ’Groundhog Day’. First 80 percent of the movie is used for training of the KNN, and the remaining 20 percent was classified. After post processing the result was 76.9 percent correctly classified. A table of content was then based on this classification.