Scaling Machine Learning Methods to Big Data Systems
MetadataShow full item record
As the world is becoming more digital, an increasing amount of data is generated that could provide industries with valuable insight to make their products or services better for customers as well as employees. To unlock this value, the data is captured and stored, and Machine Learning techniques are applied to extract the knowledge. With the emergence of Internet of Things, increasing quantities of data are available in real- time. The industry is missing a unified tool to exploit the relationship of historical and real-time data, and as a consequence, systems have been glued together for the two separate tasks. To address the issues of glueing multiple systems together, in this project we investigate how we can implement existing Machine Learning frameworks, such as WEKA, with existing Big Data Management System, e.g. AsterixDB. The main goal is to enable Machine Learning and Data Mining methods on a large scale, both for historical, stored data, and for streaming, real-time data. This unlocks the power to view historical data and real-time data in context of each other, and to continuously improve the Machine Learning techniques by training on the accumulated new arrival of data, and have the algorithms adapt as the data evolves. Experiments with 100 million records show that this implementation scales gracefully for historical data, and at the same time imposes minimal overhead to AsterixDB when processing streaming data. The solution achieves a processing throughput of up to 10 000 records a second, and scales well with the distribution of the cluster.