Applying hyperparameter optimization and other model adaption methods to tune existing models for microbial pathogens in drinking water supplies
Abstract
The collection and analysis of data on the concentration of pathogenic organisms in raw water sources is critical for the optimization of disinfection processes in water treatment plants. Nevertheless, there are no robust real-time sensors for determining microbial concentrations in raw water sources, and many water treatment plants still rely on very laborious, time consuming and costly traditional laboratory methods. In the traditional laboratory methods, the concentration official indicator bacteria in a raw water that is to be treated may not be known until 18-24 hours after the water has been distributed to the population, while that of virus and parasites may take several days to determine. Consequently, waterborne disease outbreaks associated with water supply systems often occur before remedial actions are taken because the concentrations of microbial pathogens in the raw water sources is not known beforehand.
To achieve the ultimate goal of protecting public health, early detection of microbial organisms in raw water is necessary for the development of proactive risk management strategies. Besides enhancing microbial detection tools, mathematical models can be employed to reliably predict the concentration of microbial organisms in raw water. For this purpose, several mathematical models have been developed including Machine Learning Algorithms to predict the occurrence/concentration of pathogenic organisms in raw water sources. These models use easily analysable physio-chemical parameters (e.g. temperature, pH, turbidity, electrical conductivity, colour, etc) in the raw water to predict the occurrence/concentration of microbial pathogens. Machine learning models that have been successfully applied in predicting the concentration of microbial organisms in raw water sources include support vector machines, random forests, extreme learning machines and adaptive neuro-fuzzy inference systems. However, these models are often built to predict the concentration of microbial pathogens in a single water source and are therefore often poor in predicting the concentrations of pathogens when applied to other water sources.
The overall aim of this work is to apply hyperparameter optimization of machine learning models combined with various methods of data preprocessing to improve the adaptability and effectiveness of models developed for a plant in predicting the concentrations of microbial organisms in other plants.
Data used in this work were obtained from Brusdalsvatnet in Ålesund and Maridalsdalsvannet in Oslo. Brusdalsvatnet is the main drinking water source for Ålesund Kommune and surrounding communities, while Maridalsvannet is the main drinking water source for parts of Oslo Kommune. U sing a library for the programming language Python called hyperopt-sklearn, hyperparameter optimized models were trained and the best configuration was selected. Hyperopt-sklearn, is based on scikitlearn, a package for Python containing a wide variety of tools for machine learning applications, and hyperopt, a tool for hyperparameter optimization in Python. In each experiment, the optimizer evaluates 100 different configurations of hyperparameters and pre-processors. The learners that have been used are limited to those available in the library. Of the machine learning algorithms that have been used previously to predict the concentration of microbial organisms in water, support vector machine and random forest are the only ones available in the tool. However, to see if any other of the available learners could potentially be successful in predicting microbial organisms, the optimizer was allowed to choose the learner itself as well. Optimized random forest, support vector machine and k nearest neighbour along with default random forest and support vector machine were the machine learning algorithms trained to predict the concentration of coliform bacteria and E. Coli in the data from Maridalsvannet. A visual inspection of the predictions made by these algorithms was done by plotting them against the observed values. The plots showed that that the optimized model clearly performs better than the default ones. Particularly since these values have a tendency to rise in several peaks over values that otherwise are close to zero, and the optimized models were better at recognizing these peaks and their magnitudes.
However, the machine learning package comes with scoring metrics which tell a different story. These scoring metrics would in many cases score the default algorithms better than the optimized ones. Since the optimizer uses one such scoring method by default in its internal model selection mechanism, it is reasonable to assume that even better results might be achieved if the scoring methods could recognize that predicting the peaks are much more important than accurately predicting small values around zero. Moreover, there are some challenges caused by the different water treatment plants not having the same standard procedures for accounting for different parameters in their raw water sources.
The study has shown that optimized models can significantly improve the models' ability to predict the concentration of microbial organisms in raw water sources. It is recommended that these procedures be used in further developing models for water treatment plants. Ideally, to improve the usability of these models the water treatment plants should work on a more standardized procedure as well. Furthermore, developing a new scoring method tailored for this problem in particular might further improve optimization of these models.