Unsupervised Part of Speech Tagging of Scandinavian Languages
MetadataVis full innførsel
Part of Speech tagging has been of interest to researchers for many decades. Thegoal of Part of Speech tagging is to enrich text with linguistic tags. These tagscan provide valuable syntactic insights into texts, and play a vital role in multipleNatural Language Processing systems. Historically, Part of Speech taggers have been trained using supervised methods.These methods require datasets of manually annotated text, which are both timeconsumingand expensive to collect. By using unsupervised methods, this overheadcost can be significantly reduced. Therefore, the focus of the thesis has been toimplement a Part of Speech tagger using unlabelled data. This thesis has an emphasis on tagging of Swedish, Danish and Norwegian.Some systems had been implemented for the languages, but there were several approachesleft to be explored. The languages share many similarities and challenges,which tagging of the languages solvable with a single system. The implemented taggers were based on a trigram Hidden Markov Model, andtrained using the Baum-Welch algorithm. One of the goals was easy comparisonand high consistency between the languages. Therefore, the taggers were developedusing resources from the Universal Dependencies project. This recent project hascreated a cross-linguistically consistent tagset as well as corpora and guidelines fornumerous languages. Lexical resources following the Universal Dependencies conventions were developed.These lexica were showed to have high coverage of the corpora. Conversiontables were created to mimic the conversion of the official corpora. However,the information existing of this conversion was incomplete. The conversion rulesdeveloped, therefore, have to be improved to achieve better tagging results.The accuracies achieved by the final taggers were not as good as state-of-the-arttaggers for the Scandinavian languages. Nevertheless, the results were satisfactorytaking into consideration the unsupervised training scheme.