BERT- and TF-IDF-based feature extraction for long-lived bug prediction in FLOSS: A comparative study
Journal article, Peer reviewed
Accepted version
Permanent lenke
https://hdl.handle.net/11250/3140218Utgivelsesdato
2023Metadata
Vis full innførselSamlinger
- Institutt for IKT og realfag [594]
- Publikasjoner fra CRIStin - NTNU [38046]
Sammendrag
Context: The correct prediction of long-lived bugs could help maintenance teams to build their plan and to fix more bugs that often adversely affect software quality and disturb the user experience across versions in Free/Libre Open-Source Software (FLOSS). Machine Learning and Text Mining methods have been applied to solve many real-world prediction problems, including bug report handling.
Objective: Our research aims to compare the accuracy of ML classifiers on long-lived bug prediction in FLOSS using Bidirectional Encoder Representations from Transformers (BERT)- and Term Frequency - Inverse Document Frequency (TF-IDF)-based feature extraction. Besides that, we aim to investigate BERT variants on the same task.
Method: We collected bug reports from six popular FLOSS and used the Machine Learning classifiers to predict long-lived bugs. Furthermore, we compare different feature extractors, based on BERT and TF-IDF methods, in long-lived bug prediction.
Results: We found that long-lived bug prediction using BERT-based feature extraction systematically outperformed the TF-IDF. The SVM and Random Forest outperformed other classifiers in almost all datasets using BERT. Furthermore, smaller BERT architectures show themselves as competitive.
Conclusion: Our results demonstrated a promising avenue to predict long-lived bugs based on BERT contextual embedding features and fine-tuning procedures.