Efficient k-means Using Triangle Inequality on Spark for Cyber Security Analytics
MetadataShow full item record
With the advancement in technology and the increase in the number of digital sources, data quantity increases every day and, consequently, the cyber security related data quantity. Traditional security systems such as Intrusion Detection Systems (IDS) are not capable of handling such a growing amount of data set in real time. Cyber security analytics is an alternative solution to such traditional security systems, which can use big data analytics techniques to provide a faster and scalable framework to handle a large amount of cyber security related data in real time. k-means clustering is one of the commonly used clustering algorithms in cyber security analytics aimed at dividing security related data into groups of similar entities, which in turn can help in gaining important insights about the known and unknown attack patterns. This technique helps a security analyst to focus on the data specific to some clusters only for the analysis. To improve performance, k-means can exploit the triangle inequality to skip many point-center distance computations, without affecting the clustering results. In this paper, we re-formulate the parallel version of Elkan's k-means with triangle inequality (k-meansTI) algorithm, implement this algorithm on Apache Spark, and use it to classify Web attacks in different clusters. The paper also provides the speed comparison of our parallel k-meansTI on Spark with the Spark ML k-means clustering algorithm.