An Optimized Technique for Large-scale Data Clustering

Hasnaa Sayed, Sameh Abd El-Ghany, Samir Elmougy

Hasnaa Sayed, Sameh Abd El-Ghany, Samir Elmougy

Abstract

The day-to-day growth of large-scale data makes data processing increasingly challenging. To handle this, there is a dire need for cluster computing systems that can parallelize the computations of large volume of data over a set of nodes. Among these is Apache Spark, a reliable framework for iterative machine learning processes such as clustering. Spark stores and processes data in-memory over the nodes of a cluster which makes it faster and fault-tolerant. On the other side, owing to its simplicity, K-means is still a significant topic for researchers. The large volume of data, however, increases the number of iterations and, thus, computational complexity. Further, good initial centroids play an important part in boosting the performance of K-means, especially with large data. This paper, therefore, proposes a new hybrid approach to handle these challenges by reducing the iterations of the K-means algorithm using a cutting-off method for the latest iterations and initializing centers through Scalable K-means++. This approach is applied with the Apache Spark framework. The proposed hybrid approach, called FSS.K-means, speeds up the clustering process over 46% of the time taken by the standard K-means while maintaining about 96% accuracy.

Keywords: Large Scale Data Clustering, K-means, Apache Spark, parallel computing