Miniature Machine Learning System
This system provides a configurable neural network class, and a simulator for a CSTR with a van de Vusse reaction, which can generate data to be used for training, and testing the neural network.
The neural network can be applied to a different process, by configuring its input features and outputs. Data must then be fetched from a different source, or the van de Vusse-simulator can be used as a template for developing your own simulator for a different process.
Info
The system is build with Python, uses Conda for packet management, and the PyTorch machine learning framework. The CasADi framework is used to develop the simulator.
The system as it is provides basic functionality for performing different tests of various ML techniques or technologies. The system would massively benefit from further development to create more extensive functionlity.
Contents
The system consist of:
- config.json, in which desired configurations for the system are performed.
- neural_net_class.py - implements the class for the neural net.
- train_model.py - splits the data set, trains the model, and subjects the trained model to the validation set.
- test_model.py - also splits data set, as the split data sets aren't stored. Tests model, and visualizes reults.
- make_predictions_batch.py - performs predicitons on batches of data. Pretty much similar functionality to test_model.py, other then performing on a completely separate data set. Provides a frame for if real production data would be made available at a later iteration.
- generate_input_vandevusse.py - generates input data for the simulator in a selection of different ways.
- simulate_vandevusee.py - performs the simulation of the CSTR with van de Vusse reaction subject to the generated input data. Output data is stored, so as to be used by neural network.
- steady_state_extraction.py - extracts only data points from intervals where a specified feature is in steady state.
- plotting.py - contains different plotting functions that help streamline the visualization process.
Configurability
The system is created so that any necessary configurations can be performed in config.json, without requiring much knowledge about the code base. The options are as follows:
Neural net
- "further_train_existing_net": true/false - decides whether to perform training on an already existing net, or to initialize a new one.
- "prev_net_file": "path/to/file" - the path to the net that should be further trained if above option is set to true.
- "trained_net_storage_file": "path/to/file" - where to store the trained net. Also the net that is loaded when testing the model.
- "data_set": "path/to/file" - path to the data set that should be trained, validated, and tested on.
- "input_cols": ["feature1", "feature2", ....] - the features that make up the input of the neural net.
- "output_cols": ["output1", ...] - the output feature of the neural net
- "all_output_cols": ["all possible output features"] - not used, but kept for the purpose of keeping track of possible outputs.
- "hidden_layers": [size_l1,size_l2, ....] - the number of hidden layers in the neural net and their sizes.
- "training_batch_size": x - training batch size
- "shuffle_training_data": true/false - whether or not to shuffle traiing data prior to training.
- "n_epochs": n - number of epochs for the training procedure.
- "learning_rate": x - learning rate in training procedure.
- "l2_regularization": x - regularization coefficient in training procedure.
Vdv Model - The process in the simulator
- "load_initial_state_values_from_file": true/false - whether or not to load initial values from a file.
- "initial_state_values_file": "path/to/file" - file for initial values if above is true. Also where the last last state values in the simulation are stored. In case one wants to "continue" a simulation.
- "initial_state_values": [x1, x2, x3, ...] - initial values for the states in the process, if not loaded from file.
- "simulation_result_dataset_storage_file":"path/to/file" - where to store the simulation results.
- "n_iterations": n - number of iterations to simulate for
- "simulation":
- "samples_per_hour": t - how many samples per hour in the simulation process. dt = 1/"samples_per_hour". Decides the step size of the integrator in the simulator.
- "options":
- "input_interval_size": n, how many iterations the input should stay constant for before changing,
- "perturbation_style": "single"/"double"/"F_only"/"Qk_only"- decides how the perturbation should ensue. "single" perturbates one of the inputs, chosen randomly, at the end of each interval. "double" changes both inputs. "F_only" changes only the input feed. "Qk_only" changes only the jacket cooling rate.
- "F_min":fl - minimum value for input feed
- "F_max":fu - maximum value for input feed
- "Fin_0":f0 - initial value for input feed
- "max_change_Fin": df - maximum change in input feed at each interval.
- "Qk_min":ql - minimum value for jacket cooling rate
- "Qk_max":qu - maximum value for jacket cooling rate
- "Qk_0":q0 - initial value for jacket cooling rate
- "max_change_Qk": dq - maximum change in jacket cooling rate at each interval
- "storage_file": "path/to/file" - where to store the generated input sequence
- "steady_state_variable": "feature1" - which feature should be in a steady state in the extracted data points.
- "state_change_threshold": x - the threshold change in feature value allowed in steady state
- "steady_period_duration_threshold": t - how many iterations must the feature be in steady state to make up a steady state period.
- "raw_data_file": "path/to/file" - file to fetch data to perform extraction on.
- "steady_state_data_storage_file": "path/to/file" - storage for extracted data.
Predictions
- "prediction_net_file": "path/to/file" - net to use for predictions.
- "prediction_input_data": "path/to/file" - file to data to predict on.
Pipelines
Training pipeline
Invoke the modules in the following order:
generate_input_vandevusse.py -> simulate_vandevusse.py -> steady_state_extraction.py -> train_model.py -> test_model.py
Prediciton pipeline
Invoke the modules in the following order:
generate_input_vandevusse.py -> simulate_vandevusse.py -> make_predictions_batch.py
Further work
The system consists of several components, which forms ML pipelines. More components could be implemented into the system, such as hyperparameter optimization, more extensive data pre-processing, etc. The system is modular, and integrating extra components should provide few problems.
The steady_state_extraction.py module could also benefit from more extensive functionality, e.g. by performing analyses of the dynamics of the simulation process to implement more helpful steady state extraction. This could include investigating the time constant, allowing for extracting data based on the steady state of the input, which are the features we know are always measurable.
It would be interesting to integrate other technologies into this system, such as Docker, in order to create a containerized system that could be deployed somewhere. Kafka is also an option in order to enable the functionality of an event-driven system.
Automating the pipelines would also be auspicious, and a step in the right direction of creating a system that could be subjected to a proper form of MLOps.