Using neural networks for the detection of cracking sounds during coffee roasting - Data collection, annotation, development and evaluation
Abstract
During the work on this thesis, 9 hours and 55 minutes of coffee roasting was collected by the use of several different modern cellphones. The recordings obtained during the work on this thesis, as well as an additional 3 hours and 12 minutes of coffee roasting that had previously been collected, totalling to 13 hours and 7 minutes of recordings, were annotated. All frames containing first crack sounds were labeled, and the actual time of the first crack event for each recording was indicated. Large amounts of sounds not attributed to first crack were also collected.
Noise was reduced from the recordings by using a percussive separation algorithm. Log-mel-spectrogram features using 128 log-mel-band energies as features were used as input data to the neural network classifier. Several neural network architectures were tested, and the best performing one by means of several evaluation metrics was a deep neural consisting of three hidden layers of 800 neurons each. The resulting neural network classifier had an accuracy of 0.847, a F1 score of 0.735, a MCC score of 0.6539 and an AUC of 0.886.
Two postprocessing approaches were tried. The first approach was based on time-weighting of eligible sequences of first cracks, where sequences of first crack sounds occurring earlier in time were given a higher weight than sequences of first crack sounds occurring later in time. The results showed that the opposite approach, where earlier sequences of first crack sounds were weighted lower than later sequences of first crack sounds performed the best, with a mean time difference between predicted and actual time of the first crack event of 0.93 seconds. The contradiction of the initial assumption is believed to be due to a high amount of false positives. The second postprocessing approach always used the first first crack event candidate identified as its prediction. The addition of additional non-target data as well as adjustment of the classification threshold were needed to obtain an acceptable performance for the second postprocessing approach. The optimal parameters of the latter approach were hard to define, as several trade-offs between both mean time difference, the distribution of these results, as well as the inability to find any first crack event in some recordings had to be made. The results varied widely when choosing different size of nontarget data to be added, as well as in response to adjusting the classification threshold. The second postprocessing approach were able to predict the time of the first crack event within 1 second of the actual first crack event with one setting, and was able to find at least one first crack event candidate in all recordings with another setting.