Machine Learning on Imbalanced datasets

Machine Learning on Imbalanced datasets

In real world scenarios, datasets are often imbalanced. That is, the datasets meant for supervised learning, divides into classes, where in some classes there are a very large number of instancess, compared to the others. Training machine learning algorithms on such data is challenging. We have developed an algorithm that overcomes problems of widely used algorithms.

One of the most prominent directions of research in this field involves using approaches of oversampling and undersampling to balance the data and learn from it. Synthetic Minority Over-sampling Technique (SMOTE) is the pioneer of many other effective oversampling techniques. We address a fundamental problem of the SMOTE algorithm, that is SMOTE does not oversample uniformly from the entire data manifold and thus, does not focus on a good enough approximation of the data manifold. The key idea of our algorithm Localized Randomized Affine Shadowsampling (LoRAS), is to approximate the data manifold of the minority class locally and draw samples from the locally approximated data manifold for oversampling. Testing LoRAS on some publicly available datasets we can show eveidences of improved model performances as compared to several state-of-the-art oversampling algorithms.