A Deep Learning Ensemble Approach to Gender Identification of Tweet Authors

Gopinathan, Manu; Berg, Per-Christian

dc.contributor.advisor	Gambäck, Björn
dc.contributor.author	Gopinathan, Manu
dc.contributor.author	Berg, Per-Christian
dc.date.accessioned	2017-10-04T14:00:20Z
dc.date.available	2017-10-04T14:00:20Z
dc.date.created	2017-06-11
dc.date.issued	2017
dc.identifier	ntnudaim:17178
dc.identifier.uri	http://hdl.handle.net/11250/2458477
dc.description.abstract	Author profiling is a field within Natural Language Processing, in addition to being a sub-field of the broader research area concerning authorship analysis. It aims to classify personal traits of authors, such as gender and age, based on their writing style. It is of growing importance with applications within fields such as forensics and marketing for identifying characteristics of perpetrators and customers, respectively. The emergence of social media platforms, such as Twitter, has resulted in a major increase in textual user-generated content publicly available for linguistic studies. Additionally, the informal language present in tweets provides linguistic material reflecting people s everyday usage of language. Though representation learning using deep learning has shown much promise, most of the work within author profiling research in recent years has been based on the combination of expensive manual feature engineering, representations such as Bag of Words, and traditional machine learning methods exemplified by Support Vector Machines and Logistic Regression. In this thesis we show that better gender-identifying feature representations of English tweets can be learned using deep learning approaches. We propose three classification systems, focusing on different granularities of text: a character-level Convolutional Bidirectional Long Short-Term Memory (LSTM), a word-level Bidirectional LSTM using Global Vectors (GloVe), and a more traditional document-level system utilizing a feedforward network and Bag of Words of n-grams as first-level representation. Furthermore, we propose using stacking to leverage the individual predictive powers of the sub-models in a combined effort. The experiments reveal that the word-level model outperforms the other sub-models, as well as the baseline models consisting of Logistic Regression, Naïve Bayes and Random Forest. The best performance is achieved by combining the character-level and word-level models, while the document-level model dampens the combined performance.
dc.language	eng
dc.publisher	NTNU
dc.subject	Datateknologi, Kunstig intelligens
dc.subject	Datateknologi, Interaksjonsdesign og spillteknologi
dc.title	A Deep Learning Ensemble Approach to Gender Identification of Tweet Authors
dc.type	Master thesis

Tilhørende fil(er)

Filnavn:: 17178_FULLTEXT.pdf
Størrelse:: 12.73Mb
Format:: PDF

Åpne

Filnavn:: 17178_ATTACHMENT.zip
Størrelse:: 88.86Kb
Format:: application/zip

Åpne

Filnavn:: 17178_COVER.pdf
Størrelse:: 1.556Mb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Institutt for datateknologi og informatikk [6831]

Vis enkel innførsel