Cluster-wise Oversampling to Improve Logistic Regression Model Performance in Imbalanced Data Sets
Abstract: In an imbalanced dataset with binary response, the percentages of successes and failures are very different. In many real-world cases, most of the observations are “normal” (i.e., success or 1) with a much smaller fraction of failures (0). The overall probability of correct classification for extremely imbalanced data sets can be very high but the accuracy metrics for predicting the minority class can be very low. Synthetic Minority Over-sampling TEchnique (SMOTE) improves prediction accuracy by creating extra synthetic examples of the minority class. In this example, we propose a parametric over-sampling method which generates continuous predictors from a multivariate normal distribution for the minority class. It is common knowledge that the joint distribution of predictors does not influence the fitted logistic regression model, or a multiple linear regression model, and therefore this approach to generating synthetic samples from the minority class is valid. This approach, however, can run into numerical problems in cases the sample covariance matrix S of the predictors turns out to be indefinite (i.e., some of the eigenvalues of S turn out to be negative). For such cases, we will use well-conditioned estimates of the sample covariance matrix S for random number generation. Several examples are used to illustrate the propose method. In each of these examples, an improvement in prediction accuracy is observed.