Analyzing Digital Evidence Using Parallel k-means with Triangle Inequality on Spark

Analyzing digital evidence has become a big data problem, which requires faster methods to handle them on a scalable framework. Standard k-means clustering algorithm is widely used in analyzing digital evidence. However, it is a hill-climbing method and it becomes slower with the increase of data, its dimension, and the number of cluster centers. This paper presents a framework to implement parallel k-means with triangle inequality (k-meansTI) algorithm on Spark, which is supposed to improve the speed of the standard k-means algorithm by skipping many point-center distance computations, giving the same clustering results. Our experimental results show that the parallel implementation of k-meansTI on Spark can be faster than the Spark ML k-means when a data set is large, does not contain many sparse data, and is high dimensional. These results are based on the experiments performed on six different data sets that have variations on the number of features and the number of data instances.

Utgiver

Institute of Electrical and Electronics Engineers (IEEE)