Engineering Novel Features for Diabetes Complication Prediction Using Synthetic Electronic Health Records
| dc.contributor.author | Voskergian, Daniel | |
| dc.contributor.author | Bakir-Gungor, Burcu | |
| dc.contributor.author | Yousef, Malik | |
| dc.date.accessioned | 2025-09-25T10:46:12Z | |
| dc.date.available | 2025-09-25T10:46:12Z | |
| dc.date.issued | 2025 | |
| dc.description.abstract | Diabetes significantly affects millions of people worldwide, leading to substantial morbidity, disability, and mortality rates. Predicting diabetes-related complications from health records is crucial for early prevention and for the development of effective treatment plans. In order to predict four different complications of diabetes mellitus, i.e., retinopathy, chronic kidney disease, ischemic heart disease, and amputations, this study introduces a novel feature engineering approach. While developing the classification models, we utilize XGBoost feature selection method and various supervised machine learning algorithms, including Random Forest, XGBoost, LogitBoost, AdaBoost, and Decision Tree. These models were trained on synthetic electronic health records (EHR) generated by dual-adversarial autoencoders. These EHRs represent nearly 1 million synthetic patients derived from an authentic cohort of 979,308 individuals with diabetes. The variables considered in the models were the age range accompanied by chronic diseases that occur during patient visits starting from the onset of diabetes. Throughout the experiments, XGBoost and Random Forest demonstrated the best overall prediction performance. The final models, which are tailored to each complication and trained using our feature engineering approach, achieved an accuracy between 69% and 77% and an AUC between 77% and 84% using cross-validation, while the partitioned validation approach yielded an accuracy between 59% and 78% and an AUC between 66% and 85%. These findings imply that the performance of our method surpass the performance of the traditional Bag-of-Features approach, highlighting the effectiveness of our approach in enhancing model accuracy and robustness. | en_US |
| dc.description.sponsorship | Al-Quds University, Palestine; Abdullah Gl University Support Foundation (AGUV) | en_US |
| dc.description.sponsorship | The author(s) declare that financial support was received for the research, authorship, and/or publication of this article. The work of Daniel Voskergian and Malik Yousef has been supported by Al-Quds University, Palestine. The work of Burcu Bakir-Gungor has also been supported by the Abdullah Guel University Support Foundation (AGUV). | en_US |
| dc.identifier.doi | 10.3389/fgene.2025.1451290 | |
| dc.identifier.issn | 1664-8021 | |
| dc.identifier.scopus | 2-s2.0-105004059343 | |
| dc.identifier.uri | https://doi.org/10.3389/fgene.2025.1451290 | |
| dc.identifier.uri | https://hdl.handle.net/20.500.12573/3756 | |
| dc.language.iso | en | en_US |
| dc.publisher | Frontiers Media S.A. | en_US |
| dc.relation.ispartof | Frontiers in Genetics | en_US |
| dc.rights | info:eu-repo/semantics/openAccess | en_US |
| dc.subject | Diabetes Complications | en_US |
| dc.subject | Synthetic Electronic Health Records (Ehrs) | en_US |
| dc.subject | Feature Engineering | en_US |
| dc.subject | Feature Selection | en_US |
| dc.subject | Predictive Modeling | en_US |
| dc.subject | Machine Learning | en_US |
| dc.subject | Risk Prediction | en_US |
| dc.title | Engineering Novel Features for Diabetes Complication Prediction Using Synthetic Electronic Health Records | en_US |
| dc.type | Article | en_US |
| dspace.entity.type | Publication | |
| gdc.author.scopusid | 57200259158 | |
| gdc.author.scopusid | 25932029800 | |
| gdc.author.scopusid | 14029389000 | |
| gdc.bip.impulseclass | C5 | |
| gdc.bip.influenceclass | C5 | |
| gdc.bip.popularityclass | C5 | |
| gdc.coar.access | open access | |
| gdc.coar.type | text::journal::journal article | |
| gdc.collaboration.industrial | false | |
| gdc.description.department | Abdullah Gül University | en_US |
| gdc.description.departmenttemp | [Voskergian, Daniel] Al Quds Univ, Comp Engn Dept, Jerusalem, Palestine; [Bakir-Gungor, Burcu] Abdullah Gul Univ, Fac Engn, Dept Comp Engn, Kayseri, Turkiye; [Yousef, Malik] Zefat Acad Coll, Dept Informat Syst, Safed, Israel | en_US |
| gdc.description.publicationcategory | Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı | en_US |
| gdc.description.scopusquality | Q2 | |
| gdc.description.volume | 16 | en_US |
| gdc.description.woscitationindex | Science Citation Index Expanded | |
| gdc.description.wosquality | Q2 | |
| gdc.identifier.openalex | W4409404324 | |
| gdc.identifier.pmid | 40309033 | |
| gdc.identifier.wos | WOS:001478713000001 | |
| gdc.index.type | WoS | |
| gdc.index.type | Scopus | |
| gdc.index.type | PubMed | |
| gdc.oaire.accesstype | GOLD | |
| gdc.oaire.diamondjournal | false | |
| gdc.oaire.impulse | 1.0 | |
| gdc.oaire.influence | 2.5093538E-9 | |
| gdc.oaire.isgreen | true | |
| gdc.oaire.keywords | feature engineering | |
| gdc.oaire.keywords | feature selection | |
| gdc.oaire.keywords | machine learning | |
| gdc.oaire.keywords | Genetics | |
| gdc.oaire.keywords | diabetes complications | |
| gdc.oaire.keywords | QH426-470 | |
| gdc.oaire.keywords | synthetic electronic health records (EHRs) | |
| gdc.oaire.keywords | predictive modeling | |
| gdc.oaire.popularity | 3.4909649E-9 | |
| gdc.oaire.publicfunded | false | |
| gdc.openalex.collaboration | International | |
| gdc.openalex.fwci | 5.7208 | |
| gdc.openalex.normalizedpercentile | 0.95 | |
| gdc.openalex.toppercent | TOP 10% | |
| gdc.opencitations.count | 1 | |
| gdc.plumx.crossrefcites | 1 | |
| gdc.plumx.facebookshareslikecount | 63 | |
| gdc.plumx.mendeley | 9 | |
| gdc.plumx.newscount | 1 | |
| gdc.plumx.scopuscites | 2 | |
| gdc.scopus.citedcount | 2 | |
| gdc.virtual.author | Güngör, Burcu | |
| gdc.wos.citedcount | 1 | |
| relation.isAuthorOfPublication | e17be1f8-1c9a-45f2-bf0d-f8b348d2dba0 | |
| relation.isAuthorOfPublication.latestForDiscovery | e17be1f8-1c9a-45f2-bf0d-f8b348d2dba0 | |
| relation.isOrgUnitOfPublication | 665d3039-05f8-4a25-9a3c-b9550bffecef | |
| relation.isOrgUnitOfPublication | 52f507ab-f278-4a1f-824c-44da2a86bd51 | |
| relation.isOrgUnitOfPublication | ef13a800-4c99-4124-81e0-3e25b33c0c2b | |
| relation.isOrgUnitOfPublication.latestForDiscovery | 665d3039-05f8-4a25-9a3c-b9550bffecef |
