Identifying Languages at the Word Level in Code-Mixed Indian Social Media Text

Das, Amitava; Gambäck, Björn

Das, Amitava; Gambäck, Björn

Chapter

Published version

Åpne

File52-p169.pdf (220.7Kb)

Permanent lenke

http://hdl.handle.net/11250/2385477

Utgivelsesdato

2014

Metadata

Vis full innførsel

Samlinger

Institutt for datateknologi og informatikk [6766]
Publikasjoner fra CRIStin - NTNU [37994]

Originalversjon

Sangal, Rajeev [Eds.] Proceedings of the 11th International Conference on Natural Language Processing, International Institute of Information Technology, 2014

Sammendrag

Language identification at the document level has been considered an almost solved problem in some application areas, but language detectors fail in the social media context due to phenomena such as utterance internal code-switching, lexical borrowings, and phonetic typing; all implying that language identification in social media has to be carried out at the word level. The paper reports a study to detect language boundaries at the word level in chat message corpora in mixed EnglishBengali and English-Hindi. We introduce a code-mixing index to evaluate the level of blending in the corpora and describe the performance of a system developed to separate multiple languages.

Utgiver

International Institute of Information Technology Goa, India

Serie

Proceedings of the 11th International Conference on Natural Language Processing;52