Synthetic Data Generation for Healthcare Machine Learning: A Case Study in Vital Signs and Diagnostic Predictions

Main Article Content

Sellappan Palaniappan
Rajasvaran Logeswaran
Kasthuri Subaramaniam
Oras Baker
Bui Ngoc Dung

Abstract

Healthcare Machine Learning (ML) applications face significant challenges in accessing high-quality training data due to stringent privacy regulations, institutional data silos, and concerns over patient confidentiality. This paper explores synthetic data generation as a viable and privacy-preserving alternative to real patient data for developing ML models in healthcare settings. We present techniques for generating realistic vital signs data including body temperature, blood pressure, heart rate, respiratory rate, and oxygen saturation according to appropriate statistical distributions. In addition, we demonstrate how synthetic datasets generated can be used to train diagnostic prediction models. The generated datasets were applied to multiple diagnostic prediction tasks such as hypertension, fever, Chronic Obstructive Pulmonary Disease, atrial fibrillation, and diabetes mellitus. Experimental results reveal that ML models trained solely on synthetic data achieved comparable predictive performance to those trained on real datasets for conditions with explicit physiological manifestations. In particular, gradient boosting classifiers attained an Area Under the Curve (AUC) of up to 0.89 in predicting hypertension. We also illustrate that augmenting sparse real patient data with artificial samples preserves model accuracy at the expense of decreased reliance on sensitive data. This method has great potential to satisfy healthcare organizations who are interested in creating stable ML applications without compromising on privacy standards like Health Insurance Portability and Accountability Act (HIPAA) and General Data Protection Regulation (GDPR).

Article Details

How to Cite
Palaniappan, S., Logeswaran, R., Subaramaniam, K., Baker, O., & Dung, B. N. (2026). Synthetic Data Generation for Healthcare Machine Learning: A Case Study in Vital Signs and Diagnostic Predictions. Journal of Informatics and Web Engineering, 5(2), 1–33. https://doi.org/10.33093/jiwe.2026.5.2.1
Section
Regular issue

References

M. Javaid, A. Haleem, R. P. Singh, R. Suman, and S. Rab, “Significance of machine learning in healthcare: Features, pillars and applications,” Int. J. Intell. Netw., vol. 3, pp. 58–73, 2022, doi: 10.1016/j.ijin.2022.05.002

A. Rajkomar, J. Dean, and I. Kohane, “Machine learning in medicine,” N. Engl. J. Med., vol. 380, no. 14, pp. 1347–1358, 2019, doi: 10.1056/NEJMra1814259

W. N. Price and I. G. Cohen, “Privacy in the age of medical big data,” Nat. Med., vol. 25, no. 1, pp. 37–43, 2019, doi: 10.1038/s41591-018-0272-7

T. S. Brisimi, R. Chen, T. Mela, A. Olshevsky, I. C. Paschalidis, and W. Shi, “Federated learning of predictive models from federated electronic health records,” Int. J. Med. Inform., vol. 112, pp. 59–67, 2018, doi: 10.1016/j.ijmedinf.2018.01.007

I. G. Cohen and M. M. Mello, “HIPAA and protecting health information in the 21st century,” JAMA, vol. 320, no. 3, pp. 231–232, 2020, doi: 10.1001/jama.2018.5630

H. El-Sofany, B. Bouallegue, and Y. M. A. El-Latif, “A proposed technique for predicting heart disease using machine learning algorithms and an explainable AI method,” Sci. Rep., vol. 14, p. 23277, 2024, doi: 10.1038/s41598-024-74656-2

R. Kuan, “Adopting AI in health care will be slow and difficult,” Harvard Bus. Rev. Digit. Artic., pp. 2–5, 2019.

A. Yale, S. Dash, R. Dutta, I. Guyon, A. Pavao, and K. P. Bennett, “Generation and evaluation of privacy preserving synthetic health data,” Neurocomputing, vol. 416, pp. 244–255, 2020, doi: 10.1016/j.neucom.2019.12.136

R. J. Chen, M. Y. Lu, T. Y. Chen, D. F. K. Williamson, and F. Mahmood, “Synthetic data in machine learning for medicine and healthcare,” Nat. Biomed. Eng., vol. 5, no. 6, pp. 493–497, 2021, doi: https://doi.org/10.1038/s41551-021-00751-8

M. Goyal and Q. H. Mahmoud, “A systematic review of synthetic data generation techniques using generative AI,” Electronics, vol. 13, p. 3509, 2024, doi: 10.3390/electronics13173509

K. El Emam, L. Mosquera, and R. Hoptroff, “Practical Synthetic Data Generation: Balancing Privacy and the Broad Availability of Data. Sebastopol,” CA, USA: O’Reilly Media, 2020.

D. Rankin, M. Black, R. Bond, J. Wallace, M. Mulvenna, and G. Epelde, “Reliability of supervised machine learning using synthetic data in health care: Model to preserve privacy for data sharing,” JMIR Med. Inform., vol. 8, no. 7, p. e18910, 2020, doi: 10.2196/18910

A. Goncalves, P. Ray, B. Soper, J. Stevens, L. Coyle, and A. P. Sales, “Generation and evaluation of synthetic patient data,” BMC Med. Res. Methodol., vol. 20, no. 1, pp. 1–40, 2020, doi: 10.1186/s12874-020-00977-1

A. Tucker, Z. Wang, Y. Rotalinti, and P. Myles, “Generating high-fidelity synthetic patient data for assessing machine learning healthcare software,” NPJ Digit. Med., vol. 3, no. 1, pp. 1–13, 2020, doi: 10.1038/s41746-020-00353-9

J. Jordon, D. Jarrett, J. Yoon, and M. van der Schaar, “PATE-GAN: Generating synthetic data with differential privacy guarantees,” in Proc. Int. Conf. Learn. Represent., 2019.

M. Endres, A. Mannarapotta Venugopal, and T. S. Tran, “Synthetic data generation: A comparative study,” in Proc. 2 6th Int. Database Eng. Appl. Symp., 2022, pp. 94–102, doi: 10.1145/3548785.3548793

K. El Emam and R. Hoptroff, “The synthetic data paradigm for using and sharing data,” JAMA, vol. 321, no. 16, pp. 1044–1045, 2019.

E. Choi, S. Biswal, B. Malin, J. Duke, W. F. Stewart, and J. Sun, “Generating multi-label discrete patient records using generative adversarial networks,” in Proc. Mach. Learn. Healthc. Conf., 2017, pp. 286–305, doi: 10.48550/arXiv.1703.06490

P. Raut, G. Baldini, M. Schöneck, and L. Caldeira, “Using a generative adversarial network to generate synthetic MRI images for multi-class automatic segmentation of brain tumors,” Frontiers in Radiology, 2024, doi: 10.3389/fradi.2023.1336902

V. C. Pezoulas et al., “Synthetic data generation methods in healthcare: A review on open-source tools and methods,” Comput. Struct. Biotechnol. J., vol. 23, pp. 2892–2910, 2024, doi: 10.1016/j.csbj.2024.07.005

M. Rujas, R. M. G. Del Moral Herranz, G. Fico, and B. Merino-Barbancho, “Synthetic data generation in healthcare: A scoping review of reviews on domains, motivations, and future applications,” Int. J. Med. Inform., vol. 195, p. 105763, 2025, doi: 10.1016/j.ijmedinf.2024.105763

A. Torfi, E. A. Fox, and C. K. Reddy, “Differentially private synthetic medical data generation using convolutional GANs,” Inf. Sci., vol. 586, pp. 485–500, 2022, doi: 10.48550/arXiv.2012.11774

E. De Cristofaro, “Synthetic data: Methods, use cases, and risks,” arXiv preprint arXiv:2303.01230, 2024, doi: 10.48550/arXiv.2303.01230

M. Giuffrè and D. L. Shung, “Harnessing the power of synthetic data in healthcare: Innovation, application, and privacy,” NPJ Digit. Med., vol. 6, no. 1, p. 186, 2023, doi: 10.1038/s41746-023-00927-3

J. Walonoski et al., “Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record,” J. Am. Med. Inform. Assoc., vol. 25, no. 3, pp. 230–238, 2018, doi: 10.1093/jamia/ocx079

X. Cui, M. Sui, H. Xie, W. Chen, W. Tian, P. Wang, et al., “Development of data-driven clinical pathways: The big data clinical evidence-based pathways,” BMJ Health & Care Informatics, vol. 32, p. e101312, 2025, doi: 10.1136/bmjhci-2024-101312

A. Gonzales, G. Guruswamy, and S. R. Smith, “Synthetic data in health care: A narrative review,” PLOS Digital Health, vol. 2, no. 1, p. e0000082, 2023, doi: 10.1371/journal.pdig.0000082

F. K. Dankar and K. El Emam, “Practicing differential privacy in health care: A review,” Trans. Data Priv., vol. 6, no. 1, pp. 35–67, 2013.

N. S. Bandekar, R. P. Chaudhari, Y. D. Yadav, D. Figueiredo, and M. Chunkhare, “The role of AI in EMR (electronic medical record) and patient privacy enhancement,” in Green AI-Powered Intelligent Systems for Disease Prognosis, IGI Global, 2024, pp. 301–320, doi: 10.4018/978-1-6684-9189-2.ch016

M. B. A. McDermott et al., “Reproducibility in machine learning for health research: Still a ways to go,” Sci. Transl. Med., vol. 13, no. 586, p. eabb1655, 2021, doi: 10.1126/scitranslmed.abb1655

H. Murtaza et al., “Synthetic data generation: State of the art in health care domain,” Comput. Sci. Rev., vol. 48, p. 100546, 2023, doi: 10.1016/j.cosrev.2023.1005

C. Yan, Z. Zhang, S. Nyemba, and Z. Li, “Generating synthetic electronic health record data using generative adversarial networks: Tutorial,” JMIR AI, vol. 3, p. e52615, 2024, doi: 10.2196/52615

J. M. Mendes, A. Barbar, and M. Refaie, “Synthetic data generation: A privacy-preserving approach to accelerate rare disease research,” Frontiers in Digital Health, vol. 7, p. 1563991, 2025, doi: 10.3389/fdgth.2025.1563991

I. Al-Dhamari, H. Abu Attieh, and F. Prasser, “Synthetic datasets for open software development in rare disease research,” Orphanet Journal of Rare Diseases, vol. 19, no. 1, p. 265, 2024, doi: 10.1186/s13023-024-03254-2

J. F. Rajotte et al., “Synthetic data as an enabler for machine learning applications in medicine,” iScience, vol. 25, no. 11, p. 105331, 2022, doi: 10.1016/j.isci.2022.105331

T. Kokosi and K. Harron, “Synthetic data in medical research,” BMJ Med., vol. 1, p. e000167, 2022, doi: 10.1136/bmjmed-2022-000167

I. I. Geneva, B. Cuzzo, T. Fazili, and W. Javaid, “Normal body temperature: A systematic review,” Open Forum Infect. Dis., vol. 6, no. 4, p. ofz032, 2019, doi: 10.1093/ofid/ofz032

P. K. Whelton et al., “2017 ACC/AHA/AAPA/ABC/ACPM/AGS/APhA/ASH/ASPC/NMA/PCNA guideline for the prevention, detection, evaluation, and management of high blood pressure in adults,” J. Am. Coll. Cardiol., vol. 71, no. 19, pp. e127–e248, 2018, doi: 10.1161/HYP.0000000000000065

Y. Ostchega, K. S. Porter, J. Hughes, C. F. Dillon, and T. Nwankwo, “Resting pulse rate reference data for children, adolescents, and adults: United States, 1999–2008,” Natl. Health Stat. Rep., no. 41, pp. 1–16, 2011.

W. S. Lim et al., “Defining community acquired pneumonia severity on presentation to hospital: An international derivation and validation study,” Thorax, vol. 58, no. 5, pp. 377–382, 2003, doi: 10.1136/thorax.58.5.377

B. R. O’Driscoll, L. S. Howard, J. Earis, and V. Mak, “British Thoracic Society guideline for oxygen use in adults in healthcare and emergency settings,” BMJ Open Respir. Res., vol. 4, no. 1, p. e000170, 2017, doi: 10.1136/bmjresp-2016-000170

P. Muntner, R. M. Carey, S. Gidding, et al., “Potential US population impact of the 2017 ACC/AHA high blood pressure guideline,” Circulation, vol. 137, no. 2, pp. 109–118, 2018, doi: 10.1161/CIRCULATIONAHA.117.03258

R. Gordan, J. K. Gwathmey, and L. H. Xie, “Autonomic and endocrine control of cardiovascular function,” World J. Cardiol., vol. 7, no. 4, pp. 204–214, 2015, doi: 10.4330/wjc.v7.i4.204

J. F. Reckelhoff, “Gender differences in hypertension,” Curr. Opin. Nephrol. Hypertens., vol. 27, no. 3, pp. 176–181, 2018, doi: 10.1097/MNH.0000000000000404

B. Everett and A. Zajacova, “Gender differences in hypertension and hypertension awareness among young adults,” Biodemography Soc. Biol., vol. 61, no. 1, pp. 1–17, 2015, doi: 10.1080/19485565.2014.929488

E. J. Walter, S. Hanna-Jumma, M. Carraretto, and L. Forni, “The pathophysiological basis and consequences of fever,” Crit. Care, vol. 20, no. 1, p. 200, 2016, doi: 10.1186/s13054-016-1375-5

Global Initiative for Chronic Obstructive Lung Disease (GOLD), Global Strategy for the Diagnosis, Management, and Prevention of Chronic Obstructive Pulmonary Disease, 2022.

S. S. Virani et al., “Heart disease and stroke statistics—2021 update,” Circulation, vol. 143, no. 8, pp. e254–e743, 2021, doi: 10.1161/CIR.0000000000000950

C. T. January et al., “2019 AHA/ACC/HRS focused update of the 2014 AHA/ACC/HRS guideline for the management of patients with atrial fibrillation,” J. Am. Coll. Cardiol., vol. 74, no. 1, pp. 104–132, 2019, doi: 10.1161/CIR.0000000000000665

“Centers for Disease Control and Prevention”, National Diabetes Statistics Report, 2022. Atlanta, GA, USA: CDC, U.S. Dept. Health Human Serv., 2020.

Most read articles by the same author(s)

1 2 > >>