Deep Learning for Low-Resource and Morphology-Rich Language Processing
Abstract
Since data-driven Natural Language Processing (NLP) approaches emerged in the late 60s, many practical applications have been proposed, such as automatic sentiment analyzers and named entity recognizers. Today, the robust NLP models use a paradigm of big data, deep neural networks, and parameter engineering to achieve state-of-the-art results. However, multilingual models do not perform as well as English models. It is because conditions for model implementations are not feasible in Low-Resource and Morphology-Rich Languages (LR-MRLs). For example, English annotated corpora have been accumulated since the early NLP research stage so that nowadays a large amount of data can be reused to train neural networks. In contrast, few labeled corpora can be obtained for LR-MRLs NLP studies. In addition, while the morphological complexities of the English language are relatively manageable, some LR-MRL models are problematic in terms of understanding the intricate word transformation. Therefore, there is great value in attempting to improve the LR-MRL models. From a social perspective, first language speakers using LR-MRLs are diverse, and globally scattered. From a technical perspective, LR-MRL models can accelerate the development of language applications, deep learning, linguistics, and universal neuromorphic computing.
In this dissertation, deep learning solutions for LR-MRLs differ from solutions for high-resource languages that build deeper structures. Theoretically, the problem of deep learning is essentially the problem of fitting a dataset. A perfect deep learning model has parameters that are neither overfitting nor underfitting but perfectly represent the semantics of the texts. Big data can enhance the generalization power of deep neural networks while not applicable in LR-MRLs. Therefore, this paper proposes two optimization frameworks on the basis of deep learning models to make up for the deficiencies of the LR-MRL. First, Network Framework focuses on increasing the additional training costs of the neural networks. The solutions involve techniques using pre-training, ensemble learning, cross-language learning, and adversarial training. The Network Framework augments features with extra training cost to mitigate the poverties of fitting LR-MRL datasets. Second, Linguistic Framework exploits the prior linguistic information of the LR-MRLs, including morphological components, syntactic labels, part-ofspeech tags, and dependency structures of sentences. The Linguistic Framework enhances features from the intrinsic language grammars to alleviate the poor semantic representation of LR-MRL texts.
The proposed two frameworks in this dissertation are validated using four downstream NLP applications, namely, Sentiment Analysis(SA), Neural Machine Translation(NMT), Fake News Detection(FND), and Named Entity Recognition(NER). In all experiments, LR-MRL texts are used as model inputs. By standard evaluation metrics, the model performances demonstrate that using the proposed solutions from Network Framework and Linguistic Framework in a neural network can generally improve the expression and robustness of LR-MRL models.
The dissertation is organized in two parts. Part I gives an overview of the dissertation, which comprises five chapters: introduction, related works, paper summary, discussion, and conclusion. Part II consists of six chapters: one for each research article.