Engineering Novel Features for Diabetes Complication Prediction Using Synthetic Electronic Health Records

dc.contributor.author Voskergian, Daniel
dc.contributor.author Bakir-Gungor, Burcu
dc.contributor.author Yousef, Malik
dc.date.accessioned 2025-09-25T10:46:12Z
dc.date.available 2025-09-25T10:46:12Z
dc.date.issued 2025
dc.description.abstract Diabetes significantly affects millions of people worldwide, leading to substantial morbidity, disability, and mortality rates. Predicting diabetes-related complications from health records is crucial for early prevention and for the development of effective treatment plans. In order to predict four different complications of diabetes mellitus, i.e., retinopathy, chronic kidney disease, ischemic heart disease, and amputations, this study introduces a novel feature engineering approach. While developing the classification models, we utilize XGBoost feature selection method and various supervised machine learning algorithms, including Random Forest, XGBoost, LogitBoost, AdaBoost, and Decision Tree. These models were trained on synthetic electronic health records (EHR) generated by dual-adversarial autoencoders. These EHRs represent nearly 1 million synthetic patients derived from an authentic cohort of 979,308 individuals with diabetes. The variables considered in the models were the age range accompanied by chronic diseases that occur during patient visits starting from the onset of diabetes. Throughout the experiments, XGBoost and Random Forest demonstrated the best overall prediction performance. The final models, which are tailored to each complication and trained using our feature engineering approach, achieved an accuracy between 69% and 77% and an AUC between 77% and 84% using cross-validation, while the partitioned validation approach yielded an accuracy between 59% and 78% and an AUC between 66% and 85%. These findings imply that the performance of our method surpass the performance of the traditional Bag-of-Features approach, highlighting the effectiveness of our approach in enhancing model accuracy and robustness. en_US
dc.description.sponsorship Al-Quds University, Palestine; Abdullah Gl University Support Foundation (AGUV) en_US
dc.description.sponsorship The author(s) declare that financial support was received for the research, authorship, and/or publication of this article. The work of Daniel Voskergian and Malik Yousef has been supported by Al-Quds University, Palestine. The work of Burcu Bakir-Gungor has also been supported by the Abdullah Guel University Support Foundation (AGUV). en_US
dc.identifier.doi 10.3389/fgene.2025.1451290
dc.identifier.issn 1664-8021
dc.identifier.scopus 2-s2.0-105004059343
dc.identifier.uri https://doi.org/10.3389/fgene.2025.1451290
dc.identifier.uri https://hdl.handle.net/20.500.12573/3756
dc.language.iso en en_US
dc.publisher Frontiers Media S.A. en_US
dc.relation.ispartof Frontiers in Genetics en_US
dc.rights info:eu-repo/semantics/openAccess en_US
dc.subject Diabetes Complications en_US
dc.subject Synthetic Electronic Health Records (Ehrs) en_US
dc.subject Feature Engineering en_US
dc.subject Feature Selection en_US
dc.subject Predictive Modeling en_US
dc.subject Machine Learning en_US
dc.subject Risk Prediction en_US
dc.title Engineering Novel Features for Diabetes Complication Prediction Using Synthetic Electronic Health Records en_US
dc.type Article en_US
dspace.entity.type Publication
gdc.author.scopusid 57200259158
gdc.author.scopusid 25932029800
gdc.author.scopusid 14029389000
gdc.bip.impulseclass C5
gdc.bip.influenceclass C5
gdc.bip.popularityclass C5
gdc.coar.access open access
gdc.coar.type text::journal::journal article
gdc.collaboration.industrial false
gdc.description.department Abdullah Gül University en_US
gdc.description.departmenttemp [Voskergian, Daniel] Al Quds Univ, Comp Engn Dept, Jerusalem, Palestine; [Bakir-Gungor, Burcu] Abdullah Gul Univ, Fac Engn, Dept Comp Engn, Kayseri, Turkiye; [Yousef, Malik] Zefat Acad Coll, Dept Informat Syst, Safed, Israel en_US
gdc.description.publicationcategory Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı en_US
gdc.description.scopusquality Q2
gdc.description.volume 16 en_US
gdc.description.woscitationindex Science Citation Index Expanded
gdc.description.wosquality Q2
gdc.identifier.openalex W4409404324
gdc.identifier.pmid 40309033
gdc.identifier.wos WOS:001478713000001
gdc.index.type WoS
gdc.index.type Scopus
gdc.index.type PubMed
gdc.oaire.accesstype GOLD
gdc.oaire.diamondjournal false
gdc.oaire.impulse 1.0
gdc.oaire.influence 2.5093538E-9
gdc.oaire.isgreen true
gdc.oaire.keywords feature engineering
gdc.oaire.keywords feature selection
gdc.oaire.keywords machine learning
gdc.oaire.keywords Genetics
gdc.oaire.keywords diabetes complications
gdc.oaire.keywords QH426-470
gdc.oaire.keywords synthetic electronic health records (EHRs)
gdc.oaire.keywords predictive modeling
gdc.oaire.popularity 3.4909649E-9
gdc.oaire.publicfunded false
gdc.openalex.collaboration International
gdc.openalex.fwci 5.7208
gdc.openalex.normalizedpercentile 0.95
gdc.openalex.toppercent TOP 10%
gdc.opencitations.count 1
gdc.plumx.crossrefcites 1
gdc.plumx.facebookshareslikecount 63
gdc.plumx.mendeley 9
gdc.plumx.newscount 1
gdc.plumx.scopuscites 2
gdc.scopus.citedcount 2
gdc.virtual.author Güngör, Burcu
gdc.wos.citedcount 1
relation.isAuthorOfPublication e17be1f8-1c9a-45f2-bf0d-f8b348d2dba0
relation.isAuthorOfPublication.latestForDiscovery e17be1f8-1c9a-45f2-bf0d-f8b348d2dba0
relation.isOrgUnitOfPublication 665d3039-05f8-4a25-9a3c-b9550bffecef
relation.isOrgUnitOfPublication 52f507ab-f278-4a1f-824c-44da2a86bd51
relation.isOrgUnitOfPublication ef13a800-4c99-4124-81e0-3e25b33c0c2b
relation.isOrgUnitOfPublication.latestForDiscovery 665d3039-05f8-4a25-9a3c-b9550bffecef

Files