Machine Learning on Imbalanced datasets

Machine Learning on Imbalanced datasets

In real world scenarios, datasets are often imbalanced. That is, the datasets meant for supervised learning, divides into classes, where in some classes there are a very large number of instancess, compared to the others. Training machine learning algorithms on such data is challenging. We have developed an algorithm that overcomes problems of widely used algorithms.

The Synthetic Minority Oversampling TEchnique (SMOTE) is widely-used for the analysis of imbalanced datasets. It is known that SMOTE frequently over-generalizes the minority class, leading to misclassifications for the majority class, and effecting the overall balance of the model. We present an approach that overcomes this limitation of SMOTE and its extensions, employing Localized Random Affine Shadowsampling (LoRAS) to oversample from an approximated data manifold of the minority class. We benchmarked our LoRAS algorithm with 28 publicly available datasets and can show the improved approximation of the data manifold for a given class in those datasets. In addition, we have constructed a mathematical framework to prove that LoRAS is a more effective oversampling technique since it provides a better estimate to mean of the underlying local data distribution of the minority class data space. We compared the performance of LoRAS, SMOTE, and several SMOTE extensions and observed that for imbalanced datasets LoRAS, on average generates better predictive Machine Learning (ML) models in terms of F1-score and Balanced Accuracy. For a visual summary of our algorithm please click on the LINK (pdf).