Modeling and Confidence in a System for Automatic Classification of Birdsong
Abstract
It turns out that using a two-state-HMM model structure with appurtenant GMM-based state distributions improves the system performance compared to the use of just GMMs as model structure for each bird specie. Hence, it is reasonable to say that birdsong contains temporary information which HMMs deals with in a better way than GMMs. On average, 82.31% of the birdsong in the test sets gets classified correct using HMMs. The use of GMMs achieves an average correctness of 80.1% applying the same test sets. However, further improvement of the resulting system performance are dependent on a bigger database such that the models can be trained with more example data which cover all sorts of recording surroundings and cover more of the variances of birdsong that exists within a bird specie.
The penalty value set in the decoder which adds a "penalty" when jumping from one bird class to another should ideally be set in a way such that the number of insertions (more segments) and deletions (less segments) of the recognition gets equal. However, it turns out that these insertions and deletions mainly concerns pause segments (silence and other sounds), and therefore does not affect the birdsong classification results noteworthy. One can also say that it is more important to not lose any information by deletions compared to the disorientation we get from added segments by insertions.
When it comes to frame length and corresponding window length used in the short-time stationary frequency analysis, it turns out that the frame length should at least be of 20 ms. The results where quite the same with the use of a frame length of 20 ms and 25 ms with corresponding window lengths of 30 ms and 40 ms, respectively. However, the use of such "big" frame lengths, leads to less example data per model distribution which again leads to weaker models. Hence, "big" frame lengths requires more training data.The different amounts of coefficients extracted from the acoustic signal does not vary the resulting system performance noteworthy. 15 coefficients turned out to give the best performance. It is reasonable to think that less than 12 coefficients are insufficient and that more than 19 coefficients are unnecessary.
The different bird species give different contributions to the total error. 9 out of 21 bird species gets recognized correct 100% of the time, while some few bird species gets recognized correct only 50% of the time or lower. The reason for this could be that birdsong from some bird species are very similar, i.e. have similar frequency content, making it difficult to distinguish between them. Insufficient amount or poor quality of the example data for these bird species could also be one possible reason. It turns out that the results achieved from the five different test sets used for testing the system vary a lot. One of the test sets achieves an error of 12.4%, while another test set achieves an error of 23.8%. Hence, it is reasonable to believe that the test data applied in this thesis are not fully representative to the input data applied the system later on, and conclusions of the system performance based on this test data should not be trusted blindly. This gap between the achieved results from the different test sets could also imply that the models are trained with an inadequate amount of example data, i.e. the database used in this thesis is too small. The gap between the performance of the system applying the training data and the test data implies that the system do not generalize well, this supports the thought of a too small database.
The out-of-class detector implemented in order to deal with unknown birdsong turns out to be a good idea. Setting thresholds for the log likelihood score for each recognized segment corresponding to the different bird classes from the test set makes it possible to classify 34% of the unknown birdsong as unknown. However, such an out-of-class detector requires a bigger database than we got today. In this thesis, the thresholds for the log likelihood scores are set by investigating the scores achieved from the training set and the test set. This is not optimal. With a bigger database, an untouched data set could be spared for finding these thresholds. The bigger database, the more likely it is that the thresholds get set from a data set that is representative to later input data. With a bigger database the out-of-class detector will work in a better way which also makes the total classification system better. Alternatively, a separate model for all these unknown bird classes could be designed, but this model is difficult to design because of the expected complexity and the lack of acoustic recordings, both from known and unknown bird species.
This thesis also discovers that setting the thresholds for use in the out-of-class detector is difficult. It is a compromise between allowing false accepts and false rejects. However, with the database used in this thesis it turns out that the system are very sensitive to false rejects. This is due to the way the out-of-class detector are being tested in this thesis, where only a small amount of the total birdsong files in the test set belongs to unknown bird classes. Hence, the thresholds must be set such that no false rejects are allowed while we are getting rid of as many false accepts as possible at the same time. A system that knows few bird species out of the total amount of bird species that exists in the nature are likely to face a lot of unknown birdsong. It is therefore important to set the thresholds high enough in order to avoid a lot of false accepts in this case. On the other hand, a system that knows the most of the existing bird species that exists are not very likely to face unknown birdsong. Hence, it is important that the thresholds are set low enough such that false rejects are avoided in this case.