Engineering Novel Features for Diabetes Complication Prediction Using Synthetic Electronic Health Records
Loading...

Date
2025
Journal Title
Journal ISSN
Volume Title
Publisher
Frontiers Media S.A.
Open Access Color
GOLD
Green Open Access
Yes
OpenAIRE Downloads
OpenAIRE Views
Publicly Funded
No
Abstract
Diabetes significantly affects millions of people worldwide, leading to substantial morbidity, disability, and mortality rates. Predicting diabetes-related complications from health records is crucial for early prevention and for the development of effective treatment plans. In order to predict four different complications of diabetes mellitus, i.e., retinopathy, chronic kidney disease, ischemic heart disease, and amputations, this study introduces a novel feature engineering approach. While developing the classification models, we utilize XGBoost feature selection method and various supervised machine learning algorithms, including Random Forest, XGBoost, LogitBoost, AdaBoost, and Decision Tree. These models were trained on synthetic electronic health records (EHR) generated by dual-adversarial autoencoders. These EHRs represent nearly 1 million synthetic patients derived from an authentic cohort of 979,308 individuals with diabetes. The variables considered in the models were the age range accompanied by chronic diseases that occur during patient visits starting from the onset of diabetes. Throughout the experiments, XGBoost and Random Forest demonstrated the best overall prediction performance. The final models, which are tailored to each complication and trained using our feature engineering approach, achieved an accuracy between 69% and 77% and an AUC between 77% and 84% using cross-validation, while the partitioned validation approach yielded an accuracy between 59% and 78% and an AUC between 66% and 85%. These findings imply that the performance of our method surpass the performance of the traditional Bag-of-Features approach, highlighting the effectiveness of our approach in enhancing model accuracy and robustness.
Description
Keywords
Diabetes Complications, Synthetic Electronic Health Records (Ehrs), Feature Engineering, Feature Selection, Predictive Modeling, Machine Learning, Risk Prediction, feature engineering, feature selection, machine learning, Genetics, diabetes complications, QH426-470, synthetic electronic health records (EHRs), predictive modeling
Fields of Science
Citation
WoS Q
Q2
Scopus Q
Q2

OpenCitations Citation Count
1
Source
Frontiers in Genetics
Volume
16
Issue
Start Page
End Page
PlumX Metrics
Citations
CrossRef : 1
Scopus : 2
Captures
Mendeley Readers : 9
Google Scholar™


