Identifying Sarcasm in English and Norwegian Twitter Messages

Trelease, Hanne Marie

Trelease, Hanne Marie

Master thesis

Åpne

15882_FULLTEXT.pdf (Låst)

15882_COVER.pdf (Låst)

15882_ATTACHMENT.zip (Låst)

Permanent lenke

http://hdl.handle.net/11250/2615855

Utgivelsesdato

2017

Metadata

Vis full innførsel

Samlinger

Institutt for datateknologi og informatikk [6828]

Sammendrag

Twitter is today a very popular microblogging platform with vast amounts of available data. This has created an interest in collecting information from Twitter data, for example in the form of sentiment analysis. Natural Language Processing (NLP) generally interprets the literal meaning of a text, which makes sarcasm a disruptive factor in sentiment analysis and other NLP tasks. The intended meaning of a sarcastic sentence is often the opposite of the literal meaning, which causes the polarity of the sentence to flip. Due to the challenge sarcasm presents, researchers have shown interest in automatic sarcasm detection of social media and Twitter data.

This Master's Thesis introduces a sarcasm detection system for Twitter messages, known as tweets, for Norwegian and English data. The system detects sarcasm by using a supervised machine learning approach, and evaluations of three different machine learning classifiers are presented for the two languages.

The impact of hashtag splitting and emojis on sarcasm detection and on the different feature groups used is also explored.

Norwegian and English datasets of automatically annotated Norwegian and English tweets have been created, taking advantage of the fact that many Twitter users mark their messages as sarcastic by using sarcasm hashtags (e.g., "#sarcasm"). However, not all tweets containing such sarcasm hashtags can be interpreted as sarcastic. To include the sarcasm hashtags with the highest share of tweets considered as sarcastic in the datasets, a small review of possible Norwegian and English sarcasm hashtags has been made.

The created English corpus is included in a comparison of datasets collected during different years. This comparison shows that training a classifier on a dataset that include tweets from several years overall performs better at classifying new, unseen tweets than a classifier trained on a dataset of tweets from one specific year.

From the same comparison, it can also be seen that an English classifier predicting sarcasm in translated Norwegian tweets does not outperform a Norwegian classifier trained on original Norwegian data.

Utgiver

NTNU