Identifying Languages at the Word Level in Code-Mixed Indian Social Media Text
Chapter
Published version
Permanent lenke
http://hdl.handle.net/11250/2385477Utgivelsesdato
2014Metadata
Vis full innførselSamlinger
Originalversjon
Sangal, Rajeev [Eds.] Proceedings of the 11th International Conference on Natural Language Processing, International Institute of Information Technology, 2014Sammendrag
Language identification at the document level has been considered an almost solved problem in some application areas, but language detectors fail in the social media context due to phenomena such as utterance internal code-switching, lexical borrowings, and phonetic typing; all implying that language identification in social media has to be carried out at the word level. The paper reports a study to detect language boundaries at the word level in chat message corpora in mixed EnglishBengali and English-Hindi. We introduce a code-mixing index to evaluate the level of blending in the corpora and describe the performance of a system developed to separate multiple languages.