Using Domain Knowledge in Classifying Industrial Data from the Oil and Gas Sector
Abstract
Finding good features for performing supervised learning on high dimensional industrial datasets can be challenging, as the feature set typically consists of hundreds to thousands of features. Specific features might follow protocols or custom coding standards that unless decoded, are unusable by machine learning algorithms. This is often the case in industrial environments, where you need domain knowledge to interpret the semantics of the data. The objective of this research is to enable classification of industrial work orders into a predefined set of failure mode codes. Analyzing the effect of incorporating domain knowledge in the preprocessing phase of the supervised learning process is the main focus of the study. A thorough analysis is conducted to assess multiple supervised learning algorithms, to find fitting evaluation metrics, as well as to appraise the effect of extracting features from both structured and unstructured fields. Our experiments show that incorporating domain knowledge in the preprocessing phase improves the performance of the classifiers substantially. By utilizing domain knowledge we were able to increase the performance of the classifiers with approximately 0.07 measured by Cohen's Kappa, an average relative improvement of 25.2%. An assessment of the feature importance in one of the final classifiers, showed that the sum of the importance of features extracted using domain knowledge was 38.97%. This implies that applying domain knowledge during feature extraction is crucial in order to avoid erroneous pruning of important encoded features, and to be able to extract more information from the dataset. The best classifier is currently not accurate enough to automatically label work orders with a failure mode code, but it is accurate enough to suggest failure mode codes when an operator submits new work orders.