Local Similarity-Based Approach for Multivariate Missing Data Imputation

Aditya Dubey, Akhtar Rasool

Aditya Dubey, Akhtar Rasool

Abstract

In many application datasets, missing values are found frequently due to various reasons including sensor failures, communication failures, environmental disruptions, and human errors. Missing values should be recovered or measured from the datasets in the preprocessing phase of data mining before be used in the data analytics phase. Most of the current techniques make use of the whole dataset for making imputation of the missing values causing the technique to be of higher complexity considering the irrelevant records into consideration of imputation. To address this issue, this paper proposed a new method of imputing the local similarities which calculate missing data using clustering and top KNN approaches. A clustering-based hybrid approach is used. For filling the missed value of each cluster, the top KNN approach based on the concept of weighted distance is used. Adequate and reasonable imputation outcomes are achieved by the suggested new hybrid process. The results are compared against imputations by mean substitution and Fuzzy C Means (FCM). The proposed imputation technique shows that it performs better to other imputation procedures.

Keywords: Clustering, Imputation, KNN, Missing at Random, Multivariate.