Vis enkel innførsel

dc.contributor.authorMemon, Abdul Ghafoor
dc.contributor.authorImran, Ali Shariq
dc.contributor.authorDaudpota, Sher Muhammad
dc.contributor.authorKastrati, Zenun
dc.contributor.authorShaikh, Sarang
dc.contributor.authorBatra, Rakhi
dc.date.accessioned2023-11-08T08:33:38Z
dc.date.available2023-11-08T08:33:38Z
dc.date.created2023-08-31T08:14:24Z
dc.date.issued2023
dc.identifier.issn1932-6203
dc.identifier.urihttps://hdl.handle.net/11250/3101277
dc.description.abstractLow-resource languages are gaining much-needed attention with the advent of deep learning models and pre-trained word embedding. Though spoken by more than 230 million people worldwide, Urdu is one such low-resource language that has recently gained popularity online and is attracting a lot of attention and support from the research community. One challenge faced by such resource-constrained languages is the scarcity of publicly available large-scale datasets for conducting any meaningful study. In this paper, we address this challenge by collecting the first-ever large-scale Urdu Tweet Dataset for sentiment analysis and emotion recognition. The dataset consists of a staggering number of 1, 140, 821 tweets in the Urdu language. Obviously, manual labeling of such a large number of tweets would have been tedious, error-prone, and humanly impossible; therefore, the paper also proposes a weakly supervised approach to label tweets automatically. Emoticons used within the tweets, in addition to SentiWordNet, are utilized to propose a weakly supervised labeling approach to categorize extracted tweets into positive, negative, and neutral categories. Baseline deep learning models are implemented to compute the accuracy of three labeling approaches, i.e., VADER, TextBlob, and our proposed weakly supervised approach. Unlike the weakly supervised labeling approach, the VADER and TextBlob put most tweets as neutral and show a high correlation between the two. This is largely attributed to the fact that these models do not consider emoticons for assigning polarity.en_US
dc.language.isoengen_US
dc.publisherPublic Library of Scienceen_US
dc.rightsNavngivelse 4.0 Internasjonal*
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/deed.no*
dc.titleSentiUrdu-1M: A large-scale tweet dataset for Urdu text sentiment analysis using weakly supervised learningen_US
dc.title.alternativeSentiUrdu-1M: A large-scale tweet dataset for Urdu text sentiment analysis using weakly supervised learningen_US
dc.typePeer revieweden_US
dc.typeJournal articleen_US
dc.description.versionpublishedVersionen_US
dc.source.volume18en_US
dc.source.journalPLOS ONEen_US
dc.source.issue8en_US
dc.identifier.doi10.1371/journal.pone.0290779
dc.identifier.cristin2171188
cristin.ispublishedtrue
cristin.fulltextoriginal
cristin.qualitycode1


Tilhørende fil(er)

Thumbnail

Denne innførselen finnes i følgende samling(er)

Vis enkel innførsel

Navngivelse 4.0 Internasjonal
Med mindre annet er angitt, så er denne innførselen lisensiert som Navngivelse 4.0 Internasjonal