Classification of Maintenance Reports - Statistical NLP meets the Oil & Gas Industry
Abstract
Several problematic data characteristics were revealed, such as multilingual reports, and significant class imbalances. While no consistent scheme for conduct-ing data preparation was found, several techniques were frequently reiterated in the most promising experiments. For the three classifiers tested (Naive Bayes, Support Vector Machines, and Random Forest), Support Vector Machines was the overall best choice, being the only classifier to generalize well beyond observed data. The various re-sampling techniques decreased the overall performance, which seems to indicate that more noise was generated instead.