Indexing of Audio Databases: Event Log of Broadcast News
MetadataVis full innførsel
The amount of non-textual media on the Internet is increasing, which creates a greater need of being able to search in this type of media. The goal with this thesis is to be able to do information search by use of soundtracks in audio databases. To get to know the content in an audio file, one wants a system that can automatically extract necessary information. The first step in making this system is to record what is happening at which time in an event log. This thesis treats the beginning of such a process. The experiments performed dealt with detection of pauses lasting longer than 1 second and detection of speaker changes. The corpus used in experiments consists of news broadcasts from The Norwegian Broadcasting Corporation (NRK) radio. Each broadcast had a transcription, which was used as a reference when evaluating the results. Another corpus, the HUB-4 1997 evaluation data, was used for comparative tests.A lot of work treating indexing of audio databases has already been conducted. As corpora are different, there may be varying results obtained from the same methods. In this thesis, common segmentation methods have been used with the parameters adapted to give as good results as possible with the given corpus. In the pause detection, model-based segmentation was used. A Gaussian mixture model was implemented for each of the two events: sound and long pause. For the speaker segmentation, experiments with different metric-based segmentation techniques were performed. The Bayesian information criterion (BIC) and a modified version of this criterion were tested with different options and parameter values. A false alarm compensation based on the symmetric Kullback-Leibler distance was implemented as an attempt to reduce the number of false change points. The pause detection was not successful. By using the manual transcription as reference, an F-score of 38.1 % was obtained when the settings were adjusted to result in about the same numbers for false alarms and false rejections. However, further investigation showed that the transcription had flaws with respect to labeling of pauses. An evaluation of the wrongly inserted pauses showed that most of these segments actually contained silence or noise. However, the number of pauses missed was unknown, and it was not possible to get a reliable F-score. An attempt on labeling all pauses in the HUB-4 1997 data was done. With the modified transcription, an F-score of 81.7 % was obtained. However, it is possible that unlabeled pauses still exist in the transcription, as the labeling was performed by only looking at the audio signal. From classification experiments it became clear that using 1st and 2nd order delta coefficients in the feature vectors gave an improvement over just using static MFCCs. An F-score of 98.8 % was obtained from these experiments, which implies that the models are good when the segment boundaries are known. In order to get trustworthy results from the recognition task, a review of the transcription must be done.When using the modified version of BIC and false alarm compensation for speaker change detection, an F-score of 77.1 % were obtained. The average mismatch between correctly detected change points and reference transcription was 339 milliseconds. As a measure of how good the algorithm is, an F-score of 72.8 % was obtained with the HUB-4 1997 data. Ajmera et al. (2002) obtained an F-score of 67 % with the same data. It became clear that full covariance matrices gave an improvement over diagonal covariance matrices and that static MFCCs as feature vectors gave better results than MFCCs including delta coefficients. Inclusion of pitch as another feature did not contribute to any improvement of the results.