Classifying motion picture audio

2007

ENGELSK:

Classification of the audio track, of a motion picture, into traditional audio classes is a

challenging task. The reason for this is the amount of mixed content. Speech has often

background music or environmental sounds, and music has often background environmental

sounds or speech. Traditional methods for separating clear audio classes

have limited performance on mixed audio content. New methods and tools for automatic

classification of this type of audio are therefore needed. This project investigates

combinations of low level descriptors, dimensionality reduction by PCA and classification

by KNN. A feature set consisting of Audio Power (AP), Audio Wave Form (AWF), Root-

Mean-Square (RMS), Short Time Energy (STE), Low Short-Time Energy Ratio, Zero-

Crossing Rate (ZCR), High Zero-Crossing Rate Ratio (HZCRR) in the time domain and

Audio Spectrum Centroid (ASC), Fundamental Frequency (FuF), Mel, Frequency Cepstral

Coefficients (MFCC), Spectrum Flux (SF) in the frequency domain is extracted on

30ms windows and integrated over a 1.2 second frame to yield a 23-dimmensional feature

vector. The most suitable combination for separating speech from background music

were; AP, ASC, AWF, STE, RMS, SF, fourth MFCC, AP(min scalar), ZCR(min scalar) and

STE(min scalar). The combination has a majority of descriptors from the time domain.

Most suitable combination of LLDs to separate speech with background environmental

sounds from clear environmental sounds was found to be the 1th, 4th, 5th, 6th, 8th and

9th Mel Frequency cepstran coefficients. MFCC is in the frequency domain. Best results

were achieved when the PCA returned 3 dimensions, and when the KNN classified the

samples based on the 4 closest neighbors. The results from testing different mixtures of

speech and music, to find the boundary where speech with background music no longer

is categorized as speech, showed that the music signal had to be minimum 8 dB below the

speech signal to be classified as speech. Some of the low level descriptors which traditionally

performs well when separating clear classes, performed poorly in the experiments

with mixed classes. Especially ZCR and FuF failed to separate speech with background

music from clear music. A final experiment classifies the audio track from the motion

picture ’Groundhog Day’. First 80 percent of the movie is used for training of the KNN,

and the remaining 20 percent was classified. After post processing the result was 76.9

percent correctly classified. A table of content was then based on this classification.