Engineering Novel Features for Diabetes Complication Prediction Using Synthetic Electronic Health Records

Loading...
Publication Logo

Date

2025

Journal Title

Journal ISSN

Volume Title

Publisher

Frontiers Media S.A.

Open Access Color

GOLD

Green Open Access

Yes

OpenAIRE Downloads

OpenAIRE Views

Publicly Funded

No
Impulse
Average
Influence
Average
Popularity
Average

Research Projects

Journal Issue

Abstract

Diabetes significantly affects millions of people worldwide, leading to substantial morbidity, disability, and mortality rates. Predicting diabetes-related complications from health records is crucial for early prevention and for the development of effective treatment plans. In order to predict four different complications of diabetes mellitus, i.e., retinopathy, chronic kidney disease, ischemic heart disease, and amputations, this study introduces a novel feature engineering approach. While developing the classification models, we utilize XGBoost feature selection method and various supervised machine learning algorithms, including Random Forest, XGBoost, LogitBoost, AdaBoost, and Decision Tree. These models were trained on synthetic electronic health records (EHR) generated by dual-adversarial autoencoders. These EHRs represent nearly 1 million synthetic patients derived from an authentic cohort of 979,308 individuals with diabetes. The variables considered in the models were the age range accompanied by chronic diseases that occur during patient visits starting from the onset of diabetes. Throughout the experiments, XGBoost and Random Forest demonstrated the best overall prediction performance. The final models, which are tailored to each complication and trained using our feature engineering approach, achieved an accuracy between 69% and 77% and an AUC between 77% and 84% using cross-validation, while the partitioned validation approach yielded an accuracy between 59% and 78% and an AUC between 66% and 85%. These findings imply that the performance of our method surpass the performance of the traditional Bag-of-Features approach, highlighting the effectiveness of our approach in enhancing model accuracy and robustness.

Description

Keywords

Diabetes Complications, Synthetic Electronic Health Records (Ehrs), Feature Engineering, Feature Selection, Predictive Modeling, Machine Learning, Risk Prediction, feature engineering, feature selection, machine learning, Genetics, diabetes complications, QH426-470, synthetic electronic health records (EHRs), predictive modeling

Fields of Science

Citation

WoS Q

Q2

Scopus Q

Q2
OpenCitations Logo
OpenCitations Citation Count
1

Source

Frontiers in Genetics

Volume

16

Issue

Start Page

End Page

PlumX Metrics
Citations

CrossRef : 1

Scopus : 2

Captures

Mendeley Readers : 9

Google Scholar Logo
Google Scholar™
OpenAlex Logo
OpenAlex FWCI
5.7208

Sustainable Development Goals