Preserving Logical and Functional Dependencies in Synthetic Clinical Datasets

Preserving Logical and Functional Dependencies in Synthetic Clinical Datasets

Preserving Logical and Functional Dependencies in Synthetic Clinical Datasets

DFG Project Number: 576429337

This project develops dependency-aware methods for generating synthetic clinical tabular data, focusing on preserving both logical and functional relationships among attributes while maintaining data utility and fidelity.

Dependencies among attributes are a fundamental characteristic of clinical tabular data, yet their preservation in synthetic data generation remains largely underexplored. While functional dependencies have received some attention in prior work, logical dependencies lack formal definitions and practical extraction methods.

This project aims to formalize logical dependencies, develop efficient techniques for their extraction, and propose quantitative metrics for their assessment. Furthermore, we evaluate existing synthetic data generation models for their ability to preserve both logical and functional dependencies. By incorporating dependency-aware constraints into data generation frameworks, the project seeks to enable the creation of reliable synthetic data that maintains data utility and privacy while faithfully preserving inter-attribute relationships, with particular relevance to healthcare and data-driven research.

Related publications

Dependency-aware synthetic tabular data generation

Chaithra Umesh, Kristian Seegel-Schultz, Manjunath Mahendra, Saptarshi Bej, Olaf Wolkenhauer

Pattern Recognition, Volume 179, 2026, ISSN 0031-3203, Pages 113819

Synthetic tabular data is increasingly used in privacy-sensitive domains such as healthcare, but existing generative models often fail to preserve inter-attribute relationships. In particular, functional dependencies (FDs) and logical dependencies (LDs), which capture deterministic and rule-based associations between features, are rarely or often poorly retained in synthetic datasets. To address this research gap, we propose the Hierarchical Feature Generation Framework (HFGF) for generating synthetic tabular data. We created benchmark datasets with known dependencies to evaluate our proposed HFGF. The framework first generates independent features using a standard generative model, and then reconstructs dependent features using predefined FD and LD rules. Our experiments on four benchmark datasets and three publicly available real-world datasets with varying sizes, feature imbalance, and dependency complexity demonstrate that HFGF improves the preservation of FDs and LDs across six generative models, including CTGAN, TVAE, and GReaT. Utility analysis and qualitative dependency visualizations further show that HFGF significantly enhances the structural fidelity and utility of synthetic tabular data.