Coarse-to-Fine Speech Retrieval Using Framewise Phoneme Probabilities
Abstract
Archives of digital audio and video expand, and people need to find specific information within those archives. This is why it becomes clear that a highly efficient method of searching recorded media is required. The metadata that currently tag audio information such as title, date of recording, subject or person, is not sufficient for the accurate and rapid retrieval of specifically requested information. The field of media retrieval has achieved relatively little attention, but lately, the interest has increased. New techniques to support content-based access to archives of digital audio and video information are therefore evolving and receive much attention from the research community. Recently, a novel technique for speech retrieval was presented. The technique consists of a method to represent speech as a sequence of framewise phoneme probabilities and a new method to search speech. The search method suggested is able to use the framewise phoneme probabilities to determine the most closely matched segment of speech for a spoken query. This thesis first looks at methods to improve the retrieval performance of the proposed dynamic programming algorithm. The proposed dynamic programming algorithm finds 65% of the wanted hits among the top 10 results, using our test set consisting of 1,132 speech files. The thesis then deals with ways of increasing the speed of the search. The proposed method gives somewhat promising results, reducing the response time by 11% without affecting the retrieval effectiveness.