Density-Based Spatial Clustering with Application and Noise with Spark
Abstract
With increasing number of devices that being connected to the Internet every day, analyzing the increasing amount data is being generated will be more and more important to the process. The data that are being generated are increasing in size, the data is unstructured and the data tends to arrive at different rates. In this thesis, we are exploring frameworks that enable us to process large amounts data and. The frameworks are designed in a way that enable us to use more than one machine to process the data. We will be looking at Apache Spark, which is a framework that can help us to design machine learning algorithms to process the increasing amount of data that we are facing. In this thesis, we will be looking at clustering algorithms and especially DBSCAN. Our implementation of the DBSCAN algorithm is also the basis for a naïve streaming DBSCAN we have implemented. The result will address issues and improvements regarding the implementation.