Classifying motion picture audio
Master thesis
Permanent lenke
http://hdl.handle.net/11250/144070Utgivelsesdato
2007Metadata
Vis full innførselSamlinger
- Institutt for design [1149]
Sammendrag
ENGELSK:
Classification of the audio track, of a motion picture, into traditional audio classes is a
challenging task. The reason for this is the amount of mixed content. Speech has often
background music or environmental sounds, and music has often background environmental
sounds or speech. Traditional methods for separating clear audio classes
have limited performance on mixed audio content. New methods and tools for automatic
classification of this type of audio are therefore needed. This project investigates
combinations of low level descriptors, dimensionality reduction by PCA and classification
by KNN. A feature set consisting of Audio Power (AP), Audio Wave Form (AWF), Root-
Mean-Square (RMS), Short Time Energy (STE), Low Short-Time Energy Ratio, Zero-
Crossing Rate (ZCR), High Zero-Crossing Rate Ratio (HZCRR) in the time domain and
Audio Spectrum Centroid (ASC), Fundamental Frequency (FuF), Mel, Frequency Cepstral
Coefficients (MFCC), Spectrum Flux (SF) in the frequency domain is extracted on
30ms windows and integrated over a 1.2 second frame to yield a 23-dimmensional feature
vector. The most suitable combination for separating speech from background music
were; AP, ASC, AWF, STE, RMS, SF, fourth MFCC, AP(min scalar), ZCR(min scalar) and
STE(min scalar). The combination has a majority of descriptors from the time domain.
Most suitable combination of LLDs to separate speech with background environmental
sounds from clear environmental sounds was found to be the 1th, 4th, 5th, 6th, 8th and
9th Mel Frequency cepstran coefficients. MFCC is in the frequency domain. Best results
were achieved when the PCA returned 3 dimensions, and when the KNN classified the
samples based on the 4 closest neighbors. The results from testing different mixtures of
speech and music, to find the boundary where speech with background music no longer
is categorized as speech, showed that the music signal had to be minimum 8 dB below the
speech signal to be classified as speech. Some of the low level descriptors which traditionally
performs well when separating clear classes, performed poorly in the experiments
with mixed classes. Especially ZCR and FuF failed to separate speech with background
music from clear music. A final experiment classifies the audio track from the motion
picture ’Groundhog Day’. First 80 percent of the movie is used for training of the KNN,
and the remaining 20 percent was classified. After post processing the result was 76.9
percent correctly classified. A table of content was then based on this classification.