Towards Improved Classification Accuracy on Highly Imbalanced Text Dataset Using Deep Neural Language Models

Shaikh, Sarang; Daudpota, Sher Muhammad; Imran, Ali Shariq; Kastrati, Zenun

dc.contributor.author	Shaikh, Sarang
dc.contributor.author	Daudpota, Sher Muhammad
dc.contributor.author	Imran, Ali Shariq
dc.contributor.author	Kastrati, Zenun
dc.date.accessioned	2021-03-08T09:50:29Z
dc.date.available	2021-03-08T09:50:29Z
dc.date.created	2021-01-19T12:16:16Z
dc.date.issued	2021
dc.identifier.citation	Applied Sciences. 2021, 11 (2), .	en_US
dc.identifier.issn	2076-3417
dc.identifier.uri	https://hdl.handle.net/11250/2732089
dc.description.abstract	Data imbalance is a frequently occurring problem in classification tasks where the number of samples in one category exceeds the amount in others. Quite often, the minority class data is of great importance representing concepts of interest and is often challenging to obtain in real-life scenarios and applications. Imagine a customers’ dataset for bank loans-majority of the instances belong to non-defaulter class, only a small number of customers would be labeled as defaulters, however, the performance accuracy is more important on defaulters labels than non-defaulter in such highly imbalance datasets. Lack of enough data samples across all the class labels results in data imbalance causing poor classification performance while training the model. Synthetic data generation and oversampling techniques such as SMOTE, AdaSyn can address this issue for statistical data, yet such methods suffer from overfitting and substantial noise. While such techniques have proved useful for synthetic numerical and image data generation using GANs, the effectiveness of approaches proposed for textual data, which can retain grammatical structure, context, and semantic information, has yet to be evaluated. In this paper, we address this issue by assessing text sequence generation algorithms coupled with grammatical validation on domain-specific highly imbalanced datasets for text classification. We exploit recently proposed GPT-2 and LSTM-based text generation models to introduce balance in highly imbalanced text datasets. The experiments presented in this paper on three highly imbalanced datasets from different domains show that the performance of same deep neural network models improve up to 17% when datasets are balanced using generated text.	en_US
dc.language.iso	eng	en_US
dc.rights	Navngivelse 4.0 Internasjonal	*
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/deed.no	*
dc.title	Towards Improved Classification Accuracy on Highly Imbalanced Text Dataset Using Deep Neural Language Models	en_US
dc.type	Peer reviewed	en_US
dc.type	Journal article	en_US
dc.description.version	publishedVersion	en_US
dc.source.volume	11	en_US
dc.source.journal	Applied Sciences	en_US
dc.source.issue	2	en_US
dc.identifier.doi	10.3390/app11020869
dc.identifier.cristin	1874228
dc.description.localcode	This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.	en_US
cristin.ispublished	true
cristin.fulltext	original
cristin.qualitycode	1

Tilhørende fil(er)

Filnavn:: applsci-11-00869.pdf
Størrelse:: 498.7Kb
Format:: PDF
Beskrivelse:: Shaikh

Åpne

Denne innførselen finnes i følgende samling(er)

Institutt for datateknologi og informatikk [6544]
Publikasjoner fra CRIStin - NTNU [37177]

Vis enkel innførsel

Med mindre annet er angitt, så er denne innførselen lisensiert som Navngivelse 4.0 Internasjonal