A Deep Learning Ensemble Approach to Gender Identification of Tweet Authors
Abstract
Author profiling is a field within Natural Language Processing, in addition to being asub-field of the broader research area concerning authorship analysis. It aims to classifypersonal traits of authors, such as gender and age, based on their writing style.It is of growing importance with applications within fields such as forensics and marketingfor identifying characteristics of perpetrators and customers, respectively. Theemergence of social media platforms, such as Twitter, has resulted in a major increasein textual user-generated content publicly available for linguistic studies. Additionally,the informal language present in tweets provides linguistic material reflecting people severyday usage of language.
Though representation learning using deep learning has shown much promise, most ofthe work within author profiling research in recent years has been based on the combinationof expensive manual feature engineering, representations such as Bag of Words,and traditional machine learning methods exemplified by Support Vector Machines andLogistic Regression. In this thesis we show that better gender-identifying feature representationsof English tweets can be learned using deep learning approaches.
We propose three classification systems, focusing on different granularities of text: acharacter-level Convolutional Bidirectional Long Short-Term Memory (LSTM), a word-levelBidirectional LSTM using Global Vectors (GloVe), and a more traditional document-levelsystem utilizing a feedforward network and Bag of Words of n-grams as first-levelrepresentation. Furthermore, we propose using stacking to leverage the individual predictivepowers of the sub-models in a combined effort. The experiments reveal that theword-level model outperforms the other sub-models, as well as the baseline models consistingof Logistic Regression, Naïve Bayes and Random Forest. The best performance isachieved by combining the character-level and word-level models, while the document-levelmodel dampens the combined performance.