Dr. Saptarshi Bej

Stays hungry, stays foolish, seeks to learn!

Research interest

Currently, I am pursuing several research problems in SBI Rostock.

1) Machine learning for finding a needle in a haystack and its relevance in the Systems Medicine context: The promise of personalized medicine is that diagnosis, prognosis and therapeutic decisions are more specific to the individual patient. An example for more personalized diagnostics is to combine conventional routine data, with multiple omics data. Increasing the types of data or number of features inherently increases the number of subgroups that represent patient subpopulations relevant to clinical decision-making. From a machine learning perspective, the group we target for characterization and classification will then be much smaller compared to the rest of the population. If an algorithm sees numerous cases for a “regular” or “usual” case but is exposed to only a few cases of what we are aiming to classify or predict, this is referred to an “imbalanced dataset”.

In real world scenarios, datasets are often imbalanced. That is, the datasets meant for supervised learning, divides into classes, where in some classes there are a very large number of instances, compared to the others. Training machine learning algorithms on such data is challenging. We have developed several algorithms that overcomes problems of widely used algorithms. We are looking for numerous biological/clinical applications related to personalized treatment for our methods. Furthermore, we already developed an application of these algorithms on Single-cell technology.

Synthetic oversampling based on the SMOTE algorithm has been an important cornerstone in improving imbalanced learning. We addressed the limitations of SMOTE-based oversampling algorithms through the novel idea of convex space learning. In an analytical explanation behind the idea, we show that SMOTE-based oversampling algorithms generate synthetic samples with high variance in a minority class data neighborhood. We developed the LoRAS algorithm that can model the convex space of the minority class using multiple convex combinations of shadowsamples in a minority class neighborhood.

Moreover, to address the issue of classifier dependence of SMOTE-based oversampling algorithms, we proposed the ProWRAS algorithm, an improvement over the previously proposed LoRAS algorithm. By controlling the variance of the synthetic samples, as well as a proximity-weighted clustering system of the minority class data, the ProWRAS algorithm improves the performance, compared to algorithms that generate synthetic samples through modelling high dimensional convex spaces of the minority class. Most importantly, the performance of ProWRAS with proper choice of oversampling schemes, is independent of the classifier used. We demonstrate through rigorous benchmarking studies that the ProWRAS algorithm, with proper choice of parameters, can adapt to classifier specific oversampling schemes and thereby perform in a classifier-independent way. ProWRAS have been benchmarked against the leading oversampling algorithms, for multiple datasets, demonstrating its convincing superiority over the state-of-the-art.

2) Effective patient stratification from epidemiological data: One of our relevant research´focuses on the stratification of T2DM populations from epidemiological data, analyzing the National Family Health Survey-4 (NFHS-4) dataset from India, containing a wide spectrum of features, including medical history, dietary and addiction habits, socio-economic and lifestyle patterns of 10,125 T2DM patients.

Usually, manifold learning algorithms such as t-SNE or UMAP are used for reducing and visualizing data into lower dimensions and thereby finding clusters in the data. However, we have noticed a fascinating challenge that arises from the diverse feature types typically present in clinical/epidemiological data. We found that, even though there are a small amount of continuous features in a dataset, they have an overpowering effect while using UMAP for dimension reduction. We provided a solution for this in the form of a feature-type distributed clustering framework using different distance measures for different data types.
However, the workflow was typical to the NFHS-4 dataset and not enough research could be conducted to generalize it for tabular clinical datasets with diverse feature types.

From a methodological perspective, we show that for diverse data types, frequent in epidemiological datasets, feature-type-distributed clustering using UMAP is effective as opposed to the conventional use of the UMAP algorithm. Application of UMAP based clustering workflow for this type of dataset is novel in itself. Our clustering paradigm applies UMAP separately on continuous, nominal and ordinal features separately. For each of these feature categories, we create a lower dimensional embedding of the dataset. Finally, we integrate the lower dimensional embeddings to extract clusters from them using the DBSCAN algorithm, a clustering algorithm used for extracting clusters from data based on data density. Our findings demonstrate the presence of a heterogeneity among Indian T2DM patients with regard to sociodemographic and dietary patterns. From our analysis, we conclude that, existence of significant non-obese T2DM subpopulations characterized by younger age group and economic disadvantage, raise the need of different screening criteria for T2DM among rural Indian residents.

3) Relationship extraction from biomedical texts: Natural Language Processing (NLP) has contributed to extracting relationships among biological entities, including genes, their mutations, proteins, diseases, processes, phenotypes, and drugs, for a comprehensive and concise understanding of information in the literature. Self-attention-based models for Relationship Extraction (RE) have played an increasingly important role in NLP. However, self-attention models for RE are framed as a classification problem, which limits its practical usability.

We have developed a novel approach for RE, referred to as Attention Retrieval Model (ARM), that can resolve the aforementioned limitations of the regular classification approach for RE. ARM learns the linguistic context between two related entities or between an interaction word and a related entity in a text from training data, rather than attempting to classify the text based on predefined annotations.

Our experiments show that ARM provides a flexible framework for a modeler to customize their model, with the opportunity to integrate expert knowledge on interaction keywords. ARM provides an opportunity to learn from integrated data with diverse entity types and contextual nuances of the language. This facilitates data integration across datasets. Furthermore, unlike its classification-based counterpart, ARM can extract relationships that are unannotated in the training data, analogous to zero shot learning. ARM provides a unique self-attention-based deep learning framework for RE, that can capture directed entity relationships.

4) Graph and Network theory and analysis: Graph theory is one of my passions. I love learning about the subject since my Masters degree. I am especially fascinated by the Barnette's Conjecture (unsloved since 1969). I also like to work on network analysis strategies for Protein interaction networks.



Research Projects

Machine Learning on Imbalanced datasets

In real world scenarios, datasets are often imbalanced. That is, the datasets meant for supervised learning, divides into classes, where in some classes there are a very large number of instancess, compared to the others. Training machine learning algorithms on such data is challenging. We have developed an algorithm that overcomes problems of widely used algorithms.


iRhythmics: Programming pacemaker cells for in vitro drug testing

The project addresses the generation and establishment of programmed pacemaker cells for an in vitro drug testing possibility to perform predictive tests. This may lead to an improved treatment of cardiac arrhythmias or an accurate identification of potential drug molecules at an early stage of development. Important benefits will arise in verifying the safety of a wide variety of medicines while reducing animal testing.


The TOTO Project: Towards a Theory of Tissue Organisation

 ~ In biology, the exception is the rule. ~

 ~ With our work, we are not really interested in the unique, but in what is general in the unique.~

With this project, we want to address a biological and a methodological challenge. First, we wish to clarify how the functioning of cells, and the functioning of a tissue relate to each other. Do cells exercise a degree of autonomy, or is their behavior completely determined by the functioning of the tissue? Such questions are important in understanding the emergence and progression of diseases. For example, it remains unclear whether the causative origin of colon cancer is a cell, or a consequence of tissue organization.



GB-XMap: Assessing the risk of gut-brain cross-diseases

Investigating the gut-brain-axis

The gut–brain axis (GBA) provides a bidirectional homeostatic communication between the gastrointestinal tract and the central nervous system. The interdisciplinary collaboration is going to fully explore a first comprehensive GBA cross-disease map of genetic, expression and regulatory changes associated with ulcerative colitis and schizophrenia disease entities.


Academic background

 2018-present Research Assistant and PhD student, SBI, Universität Rostock  Rostock
 2016-2017 Research assistant, Universität Paderborn
 2009-2014 Integrated BS-MS degree (major in Mathematics and specialization in Graph Theory), Indian Institute of Science Eduaction and Research, Kolkata



Selected publications

LoRAS: An oversampling approach for imbalanced datasets

Bej S, Davtyan N, Wolfien M, Nassar M, Wolkenhauer O

Contribution of Synthetic Data Generation towards an Improved Patient Stratification in Palliative Care

Hahn W, Schütte K, Schultz K, Wolkenhauer O, Sedlmayr M, Schuler U, Eichler M, Bej S, Wolfien M

JPM (2022)

Identification and epidemiological characterization of Type-2 Diabetes sub-population using an unsupervised machine learning approach

Bej S, Sarkar J, Biswas S, Mitra P, Chakrabarti P, Wolkenhauer O

Attention retrieval model for entity relation extraction from biological literature

Srivastava P, Bej S, Schultz K, Yordanova K, Wolkenhauer O

Cross-tissue transcriptome-wide association studies identify susceptibility genes shared between schizophrenia and inflammatory bowel disease

Uellendahl-Werth F, Maj C, Borisov O, Wacker EM, Bej S, Wolkenhauer O, Degenhardt F, Ellinghaus D et al.

Automated annotation of rare-cell types from single-cell RNA-sequencing data through synthetic oversampling

Bej S, Galow AM, David R, Wolfien M, Wolkenhauer O

Self-attention based models for the extraction of molecular interactions from biological texts

Srivastava P, Bej S, Yordanova K, Wolkenhauer O

Comprehensive Characterization of Multitissue Expression Landscape, Co-Expression Networks and Positive Selection in Pikeperch

Nguinkal JA, Verleih M, de los Ríos-Pérez L, Brunner RM, Sahm A, Bej S, Rebl A, Goldammer T

Cells 2021

A multi-schematic classifier-independent oversampling approach for imbalanced datasets

Bej S, Schultz K, Srivastava P, Wolfien M, Wolkenhauer O

Hamiltonian cycles in annular decomposable Barnette graphs

Bej S

JDMSC 2020. Full text in arXiv

Protein-coding variants contribute to the risk of atopic dermatitis and skin-specific gene expression

Mucha S, ... Bej S, ..., Wolfien M, ..., Wolkenhauer O, ..., Ellinghaus D

On extension of regular graphs

Banerjee A, Bej S

Coloring sums of extensions of certain graphs

Kok J, Bej S

Factors of edge-chromatic critical graphs: a brief survey and some equivalences

Bej S, Steffen E

Combining uniform manifold approximation with localized affine shadowsampling improves classification of imbalanced datasets

Bej S, Srivastava P, Wolfien M, Wolkenhauer O

2021 International Joint Conference on Neural Networks (IJCNN), 2021, pp. 1-8,

Improved imbalanced classification through convex space learning

Saptarshi Bej

Imbalanced datasets for classification problems, characterised by unequal distribution of samples, are abundant in practical scenarios. Oversampling algorithms generate synthetic data to enrich classification performance for such datasets. In this thesis, I discuss two algorithms LoRAS & ProWRAS, improving on the state-of-the-art as shown through rigorous benchmarking on publicly available datasets. A biological application for detection of rare cell-types from single-cell transcriptomics data is also discussed. The thesis also provides a better theoretical understanding behind oversampling.

Defense: 16 Dec. 2021


  • Graph and Network Theory
  • Boolean modelling
  • Python
  • Machine learning
  • Deep Learning
  • RNA seq data analysis


Awards and Distinctions

  • DAAD pries 2020 für hervorragende Leistungen ausländischer Studierender an (Universität Rostock)

Teaching Experience

  • Tutor in the 'Biosystems modelling and simulation' course offered at the University of Rostock from 2019-2020. My subject of teaching includes introduction to machine learning and deep learning and their applicability in the biomedical fields
  • Tutor in the 'Data Science with Python' undergraduate seminar course offered at the University of Rostock from 2020.