SentiUrdu-1M: A large-scale tweet dataset for Urdu text sentiment analysis using weakly supervised learning

Memon, Abdul Ghafoor; Imran, Ali Shariq; Daudpota, Sher Muhammad; Kastrati, Zenun; Shaikh, Sarang; Batra, Rakhi

dc.contributor.author	Memon, Abdul Ghafoor
dc.contributor.author	Imran, Ali Shariq
dc.contributor.author	Daudpota, Sher Muhammad
dc.contributor.author	Kastrati, Zenun
dc.contributor.author	Shaikh, Sarang
dc.contributor.author	Batra, Rakhi
dc.date.accessioned	2023-11-08T08:33:38Z
dc.date.available	2023-11-08T08:33:38Z
dc.date.created	2023-08-31T08:14:24Z
dc.date.issued	2023
dc.identifier.issn	1932-6203
dc.identifier.uri	https://hdl.handle.net/11250/3101277
dc.description.abstract	Low-resource languages are gaining much-needed attention with the advent of deep learning models and pre-trained word embedding. Though spoken by more than 230 million people worldwide, Urdu is one such low-resource language that has recently gained popularity online and is attracting a lot of attention and support from the research community. One challenge faced by such resource-constrained languages is the scarcity of publicly available large-scale datasets for conducting any meaningful study. In this paper, we address this challenge by collecting the first-ever large-scale Urdu Tweet Dataset for sentiment analysis and emotion recognition. The dataset consists of a staggering number of 1, 140, 821 tweets in the Urdu language. Obviously, manual labeling of such a large number of tweets would have been tedious, error-prone, and humanly impossible; therefore, the paper also proposes a weakly supervised approach to label tweets automatically. Emoticons used within the tweets, in addition to SentiWordNet, are utilized to propose a weakly supervised labeling approach to categorize extracted tweets into positive, negative, and neutral categories. Baseline deep learning models are implemented to compute the accuracy of three labeling approaches, i.e., VADER, TextBlob, and our proposed weakly supervised approach. Unlike the weakly supervised labeling approach, the VADER and TextBlob put most tweets as neutral and show a high correlation between the two. This is largely attributed to the fact that these models do not consider emoticons for assigning polarity.	en_US
dc.language.iso	eng	en_US
dc.publisher	Public Library of Science	en_US
dc.rights	Navngivelse 4.0 Internasjonal	*
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/deed.no	*
dc.title	SentiUrdu-1M: A large-scale tweet dataset for Urdu text sentiment analysis using weakly supervised learning	en_US
dc.title.alternative	SentiUrdu-1M: A large-scale tweet dataset for Urdu text sentiment analysis using weakly supervised learning	en_US
dc.type	Peer reviewed	en_US
dc.type	Journal article	en_US
dc.description.version	publishedVersion	en_US
dc.source.volume	18	en_US
dc.source.journal	PLOS ONE	en_US
dc.source.issue	8	en_US
dc.identifier.doi	10.1371/journal.pone.0290779
dc.identifier.cristin	2171188
cristin.ispublished	true
cristin.fulltext	original
cristin.qualitycode	1

Tilhørende fil(er)

Filnavn:: journal.pone.0290779.pdf
Størrelse:: 1.592Mb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Institutt for datateknologi og informatikk [6559]
Institutt for informasjonssikkerhet og kommunikasjonsteknologi [2525]
Publikasjoner fra CRIStin - NTNU [37317]

Vis enkel innførsel

Med mindre annet er angitt, så er denne innførselen lisensiert som Navngivelse 4.0 Internasjonal