Automated Information Extraction in Natural Language
MetadataVis full innførsel
The field of service automation is progressing rapidly, and increasingly complex tasks are being automated by robots. An area in service automation that has received a lot of attention is Natural Language Processing (NLP). In today s digital age, enormous amounts of data are produced, and most of this data is in a so-called unstructured form, including text in natural language. Such text holds information that can be very valuable to businesses, but is seen as time consuming and difficult to analyze in the business' perspective. NLP is the computerized approach to analyzing and representing human language, and can be utilized for automatic extraction of relevant information from text. Since the statistical revolution in the late 1980s, much of the research in NLP has been based on machine learning. Machine learning enables an NLP systems to automatically learn patterns of language from text samples, and recognize these patterns in new, unseen text. Machine learning is particularly applicable in language processing, where the rules often are ambiguous and difficult to process with manual coding. Most modern systems for NLP use supervised learning to train the machine learning component. This means that relevant information in text must be manually labeled in order to use the text to train a machine learning model. By selecting which information is relevant in text samples from a specific domain, the model can be customized to achieve the goal in that domain. The aim of this thesis is to provide insight into how a modern NLP system works, its limitations and potential applications. The theoretical aspect of using machine learning in NLP is presented with focus on the task of extracting information from text in natural language. A case is introduced with a design of an application for extracting information from emails from the shipping industry. The machine learning model is trained with emails which have been labeled for names of ships, contract types, and ports and dates for chartering. Two machine learning models are trained in an available NLP system called "Watson Knowledge Studio". The first model is able to recognize and label some of the variables in new emails, but is relatively inaccurate. The results are improved in the second version of the model by following the practices recommended in the discussed theory. \newpageThe results of the case confirm the necessity of large amounts of labeled data for robust training of a machine learning model. The results also indicate that the text should have linguistic structure to a certain degree in order for the underlying rules for grammar in the NLP system to be exploited for optimal processing. If the language in the domain does not have sufficient linguistic structure, other rule-based methods should also be considered, in addition to the machine learning method. A hybrid approach may be the best solution in these cases. The demonstrated case shows how the customization of a machine learning model can be based on domain knowledge, rather than coding-skills, when the system uses supervised learning to train the model. This means that the system can reach a larger target group, with more people able to take advantage of machine learning in NLP. Based on this, it may be assumed that this method will have a key role in the future of NLP. With development of support for the Norwegian language, AVO Consulting can potentially benefit from the advantages of using supervised machine learning in NLP.