Stroke Prediction using Distributed Machine Learning Based on Apache Spark

Hager Ahmed, Sara F. Abd-el ghany, Eman M.G.Youn, Nahla F.Omran, Abdelmgeid A.Ali

Hager Ahmed, Sara F. Abd-el ghany, Eman M.G.Youn, Nahla F.Omran, Abdelmgeid A.Ali

Abstract

Stroke is one of death causes and one the primary causes of severe long-term weakness in the world. In this paper, we compare different distributed machine learning algorithms for stroke prediction on the Healthcare Dataset Stroke. This work is implemented by a big data platform that is Apache Spark. Apache Spark is one of the most popular big data platforms that handle big data and includes an MLlib library. MLlib is an API integrated with Spark to provide machine learning algorithms. Four types of machine learning classification algorithms were applied; Decision Tree, Support Vector Machine, Random Forest Classifier, and Logistic Regression were used to build the stroke prediction model. The hyperparameter tuning and cross-validation were applied with machine learning algorithms to enhance results. Accuracy, Precision, Recall, and F1-measure were used to calculate performance measures of machine learning models. The results showed that Random Forest Classifier has achieved the best accuracy at 90 %.

Keywords: Stroke; Stroke Prediction; Machine Learning; Big Bata; Apache Spark