Characterizing Twitter Data using Sentiment Analysis and Topic Modeling

As the global community becomes increasingly connected, it gets more and more common to express thoughts and opinions through social networking websites. Twitter, currently the largest microblog website in the world, is heavily used for this purpose. Well known politicians, comedians and trending persons use this medium to express their minds through 140-character messages. This makes Twitter one of the platforms being most influential on the global web communities way of thinking.

This thesis combines topic modeling and sentiment analysis in order to obtain information from tweets. While sentiment analysis seeks to find out what opinions people have, topic modeling tries to find out what they talk about.

Convential topic modeling schemes, such as Latent Dirichlet Allocation, are known to perform inadequately when applied to tweets, due to the sparsity of short documents. To alleviate these disadvantages, we apply several pooling techniques, aggregating similar tweets into individual documents. We specifically study the aggregation of tweets sharing authors or hashtags.

Our Twitter Sentiment Analysis system is comprised of seven different machine learning classifiers. These aim to predict whether a message's polarity is of neutral, negative or positive sentiment. Four machine learning algorithms, Maximum Entropy, Naïve Bayes, Support Vector Machines and Stochastic Gradient Descent, have been proposed for performing sentiment classification in this thesis. The classifiers were trained through experiments of extensive grid searches on a parameter space and preprocessing methods in order to achieve optimal classification scores.

To combine topic modeling with sentiment analysis, a state-of-the-art visualization application, called TweetMoods, was built. TweetMoods simultaneously examines the topics contained in a Twitter corpus retrieved by a search query, and the sentiments expressed in these tweets.

Our topic modeling results show that aggregating similar tweets into individual documents increases the topic coherence significantly. On performing message polarity classification on tweets, the Maximum Entropy classifier yielded results outperforming most earlier submitted work to the International Workshop on Semantic Evaluation of 2015. This proves the importance of our extensive grid searches on optimizing the parameter space of the classifiers.

Publisher

NTNU