Improving Document Classification Using Ontologies
Abstract
We are living in the age of internet where massive amount of information is produced from various digital resources on daily basis. The information of these resources is typically stored in unstructured textual format such as reports, news, e-mails, blogs, etc., therefore, a proper classification and organization of this huge amount of information is apparently needed. In this regard, an automatic classification, particularly ontology-based classification, plays an important role in helping people to classify and organize the information accordingly. The ontology-based classification system is an automatic system that utilizes the ontology in order to take advantages of organizing and classifying the knowledge in a more structural and formal way, thus providing better classification accuracy comparing to the traditional keyword-based classification system.
The performance of an ontology-based document classification system can be affected by several aspects involved in the entire classification process that generally is constituted of steps such as document collection and preprocessing, document representation, dimensionality reduction, and the classifier. It is almost impossible to address all these research aspects in order to obtain performance improvement in a single dissertation research work, therefore we selected to work on the aspects that we consider are either rarely studied or have a crucial role on the ontology-based classification system.
Document representation is one of the main aspects that affects the performance of ontology-based document classification, thus the first research aspect that we investigated is enriching document representation with semantics utilizing the background knowledge exploited by ontologies. The background knowledge derived from an ontology is embedded in a document using a matching technique. The idea behind this technique is mapping of terms that occur in a document with the relevant ontology concepts by searching only the presence of concepts labels in that document. Searching only the presence of concepts labels occurring in a document limits the capabilities of the classification system to capture and exploit the entire conceptualization involved in that document due to the semantic gap issue, the lack of an in depth-coverage of concepts, and the ambiguity problem. In this thesis, the focus is placed on the conceptual document representation, in which, a document is associated with a set of concepts not only by looking for the appearance of concept labels, but also through the acquisition of lexical information integrated (linked) to the ontology to enriching its coverage with new concepts. In this respect, an automatic ontology concept enrichment model is developed to enrich ontologies with new concepts in order to provide a broader coverage for document representation. The proposed model explores textual data and relies on semantic and contextual information of terms occurring in a discourse.
The performance of ontology-based document classification is highly dependent on the relevance of concepts that is indicated by weights. The weights reflect the discriminative power of concepts with respect to the documents and are typically computed through the frequency of occurrences of concepts in these documents. Thus, the second research aspect that we studied in this research work is enhancing the existing concept weighting scheme by introducing the notion of concept importance. Concept importance assesses the contribution of a concept in discriminating between documents depending on its position in the ontology hierarchy. In addition, we explored the possibilities to automatically evaluate the concept importance and a Markov-based approach is developed. Further, we aggregated concept importance and concept relevance in order to enhance the concept weighting scheme and thus to improve the concept vector space representation model.
Lastly, the third research aspect studied in this dissertation is related to improving classification accuracy by taking the advantages of the ontology enrichment model, and the enhanced concept weighting scheme developed while studying the first and the second research aspect respectively. We proposed a document classification approach that relies on an ontology whose coverage is widen using the ontology enrichment model SEMCON and the weights of concepts are assessed through the new concept weighting technique composed of concept relevance and concept importance. Extensive experimental results demonstrated a considerable improvement of the classification effectiveness.
Has parts
Paper 1: Kastrati, Zenun; Yayilgan, Sule; Imran, Ali Shariq. SEMCON: Semantic and contextual objective metric. I: Proceedings of the 2015 IEEE 9th International Conference on Semantic Computing. s. 65-68. http://doi.org/10.1109/ICOSC.2015.7050779 © 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Paper 2: Kastrati, Zenun; Imran, Ali Shariq; Yildirim, Sule. SEMCON: A semantic and contextual objective metric for enriching domain ontology concepts. International Journal on Semantic Web and Information Systems 2016 ;Volum 12.(2) s. 1-24. http://doi.org/10.4018/IJSWIS.2016040101
Paper 3: Kastrati, Zenun; Imran, Ali Shariq; Yayilgan, Sule; Dalipi, Fisnik. Analysis of Online Social Networks Posts to Investigate Suspects Using SEMCON. I: Social Computing and Social Media 7th International Conference, SCSM 2015, Held as Part of HCI International 2015, Los Angeles, CA, USA, August 2-7, 2015, Proceedings. Springer 2015. s. 148-157. http://doi.org/10.1007/978-3-319-20367-6_16
Paper 4: Kastrati, Zenun; Imran, Ali Shariq. Adaptive Concept Vector Space Representation Using Markov Chain Model. I: Knowledge Engineering and Knowledge Management. Springer 2014. s. 203-208. http://doi.org/10.1007/978-3-319-13704-9_16
Paper 5: Kastrati, Zenun; Imran, Ali Shariq; Yildirim, Sule. An Improved Concept Vector Space Model for Ontology Based Classification. I: SITIS 2015 - The 11th International Conference on Signal Image Technology & Internet System. IEEE 2015. s. 240-245. http://doi.org/10.1109/SITIS.2015.102 © 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Paper 6: Kastrati, Zenun; Yildirim Yayilgan, Sule; Hjesvold, Rune. Automatically Enriching Domain Ontologies for Document Classification. I: Proceedings of the 6th International Conference on Web Intelligence, Mining and Semantics, WIMS 2016. Association for Computing Machinery (ACM) 2016 s. 1-4. © ACM, 2016 This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive published version http://doi.org/10.1145/2912845.2912875
Paper 7: Kastrati, Zenun; Yildirim, Sule.Supervised Ontology-Based Document Classification Model. I: Proceeding: ICCDA '17 Proceedings of the International Conference on Compute and Data Analysis. ACM Publications 2017. s. 245-251. © ACM, 2017. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published http://doi.org/10.1145/3093241.3107883
Paper 8: Kastrati, Zenun; Imran, Ali Shariq; Yildirim, Sule. A Hybrid Concept Learning Approach to Ontology Enrichment. I: Innovations, Developments, and Applications of Semantic Web and Information Systems. IGI Global 2018. s. 85-119. http://doi.org/10.4018/978-1-5225-5042-6.ch004