Browsing by Author "Bakir-Gungor, Burcu"

Now showing 1 - 20 of 59

3Mont: A Multi-Omics Integrative Tool for Breast Cancer Subtype Stratification
(Public Library Science, 2025) Unlu Yazici, Miray; Marron, J. S.; Bakir-Gungor, Burcu; Zou, Fei; Yousef, Malik; 01. Abdullah Gül University; 02. 04. Bilgisayar Mühendisliği; 02. Mühendislik Fakültesi; 04. Yaşam ve Doğa Bilimleri Fakültesi; 04.01. Biyomühendislik
Breast Cancer (BRCA) is a heterogeneous disease, and it is one of the most prevalent cancer types among women. Developing effective treatment strategies that address diverse types of BRCA is crucial. Notably, among different BRCA molecular sub-types, Hormone Receptor negative (HR-) BRCA cases, especially Basal-like BRCA sub-types, lack estrogen and progesterone hormone receptors and they exhibit a higher tumor growth rate compared to HR+ cases. Improving survival time and predicting prognosis for distinct molecular profiles is substantial. In this study, we propose a novel approach called 3-Multi-Omics Network and Integration Tool (3Mont), which integrates various -omics data by applying a grouping function, detecting pro-groups, and assigning scores to each pro-group using Feature importance scoring (FIS) component. Following that, machine learning (ML) models are constructed based on the prominent pro-groups, which enable the extraction of promising biomarkers for distinguishing BRCA sub-types. Our tool allows users to analyze the collective behavior of features in each pro-group (biological groups) utilizing ML algorithms. In addition, by constructing the pro-groups and equalizing the feature numbers in each pro-group using the FIS component, this process achieves a significant 20% speedup over the 3Mint tool. Contrary to conventional methods, 3Mont generates networks that illustrate the interplay of the prominent biomarkers of different -omics data. Accordingly, exploring the concerted actions of features in pro-groups facilitates understanding the dynamics of the biomarkers within the generated networks and developing effective strategies for better cancer sub-type stratification. The 3Mont tool, along with all supporting materials, can be found at https://github.com/malikyousef/3Mont.git.
Citation - WoS: 5
Citation - Scopus: 4
Active Subnetwork Ga: A Two Stage Genetic Algorithm Approach to Active Subnetwork Search
(Bentham Science Publ Ltd, 2017) Ozisik, Ozan; Bakir-Gungor, Burcu; Diri, Banu; Sezerman, Osman Ugur; 01. Abdullah Gül University; 02. 04. Bilgisayar Mühendisliği; 02. Mühendislik Fakültesi
Background: A group of interconnected genes in a protein-protein interaction network that contains most of the disease associated genes is called an active subnetwork. Active subnetwork search is an NP-hard problem. In the last decade, simulated annealing, greedy search, color coding, genetic algorithm, and mathematical programming based methods are proposed for this problem. Method: In this study, we employed a novel genetic algorithm method for active subnetwork search problem. We used active node list chromosome representation, branch swapping crossover operator, multicombination of branches in crossover, mutation on duplicate individuals, pruning, and two stage genetic algorithm approach. The proposed method is tested on simulated datasets and Wellcome Trust Case Control Consortium rheumatoid arthritis genome-wide association study dataset. Our results are compared with the results of a simple genetic algorithm implementation and the results of the simulated annealing method that is proposed by Ideker et al. in their seminal paper. Results and Conclusion: The comparative study demonstrates that our genetic algorithm approach outperforms the simple genetic algorithm implementation in all datasets and simulated annealing in all but one datasets in terms of obtained scores, although our method is slower. Functional enrichment results show that the presented approach can successfully extract high scoring subnetworks in simulated datasets and identify significant rheumatoid arthritis associated subnetworks in the real dataset. This method can be easily used on the datasets of other complex diseases to detect disease-related active subnetworks. Our implementation is freely available at https://www.ce.yildiz.edu.tr/personal/ozanoz/file/6611/ActSubGA.
Aguhyper: A Hyperledger-Based Electronic Health Record Management Framework
(PeerJ Inc, 2024) Dedeturk, Beyhan Adanur; Bakir-Gungor, Burcu; 01. Abdullah Gül University; 02. 04. Bilgisayar Mühendisliği; 02. Mühendislik Fakültesi
The increasing importance of healthcare records, particularly given the emergence of new diseases, emphasizes the need for secure electronic storage and dissemination. With these records dispersed across diverse healthcare entities, their physical maintenance proves to be excessively time-consuming. The prevalent management of electronic healthcare records (EHRs) presents inherent security vulnerabilities, including susceptibility to attacks and potential breaches orchestrated by malicious actors. To tackle these challenges, this article introduces AguHyper, a secure storage and sharing solution for EHRs built on a permissioned blockchain framework. AguHyper utilizes Hyperledger Fabric and the InterPlanetary Distributed File System (IPFS). Hyperledger Fabric establishes the blockchain network, while IPFS manages the off -chain storage of encrypted data, with hash values securely stored within the blockchain. Focusing on security, privacy, scalability, and data integrity, AguHyper ' s decentralized architecture eliminates single points of failure and ensures transparency for all network participants. The study develops a prototype to address gaps identi fi ed in prior research, providing insights into blockchain technology applications in healthcare. Detailed analyses of system architecture, AguHyper ' s implementation con fi gurations, and performance assessments with diverse datasets are provided. The experimental setup incorporates CouchDB and the Raft consensus mechanism, enabling a thorough comparison of system performance against existing studies in terms of throughput and latency. This contributes signi fi cantly to a comprehensive evaluation of the proposed solution and offers a unique perspective on existing literature in the fi eld.
Citation - WoS: 8
Citation - Scopus: 12
AMP-GSM: Prediction of Antimicrobial Peptides via a Grouping-Scoring Approach
(MDPI, 2023) Soylemez, Ummu Gulsum; Yousef, Malik; Bakir-Gungor, Burcu; 01. Abdullah Gül University; 02. 04. Bilgisayar Mühendisliği; 02. Mühendislik Fakültesi
Due to the increasing resistance of bacteria to antibiotics, scientists began seeking new solutions against this problem. One of the most promising solutions in this field are antimicrobial peptides (AMP). To identify antimicrobial peptides, and to aid the design and production of novel antimicrobial peptides, there is a growing interest in the development of computational prediction approaches, in parallel with the studies performing wet-lab experiments. The computational approaches aim to understand what controls antimicrobial activity from the perspective of machine learning, and to uncover the biological properties that define antimicrobial activity. Throughout this study, we aim to develop a novel prediction approach that can identify peptides with high antimicrobial activity against selected target bacteria. Along this line, we propose a novel method called AMP-GSM (antimicrobial peptide-grouping-scoring-modeling). AMP-GSM includes three main components: grouping, scoring, and modeling. The grouping component creates sub-datasets via placing the physicochemical, linguistic, sequence, and structure-based features into different groups. The scoring component gives a score for each group according to their ability to distinguish whether it is an antimicrobial peptide or not. As the final part of our method, the model built using the top-ranked groups is evaluated (modeling component). The method was tested for three AMP prediction datasets, and the prediction performance of AMP-GSM was comparatively evaluated with several feature selection methods and several classifiers. When we used 10 features (which are members of the physicochemical group), we obtained the highest area under curve (AUC) value for both the Gram-negative (99%) and Gram-positive (98%) datasets. AMP-GSM investigates the most significant feature groups that improve AMP prediction. A number of physico-chemical features from the AMP-GSM's final selection demonstrate how important these variables are in terms of defining peptide characteristics and how they should be taken into account when creating models to predict peptide activity.
Citation - WoS: 50
Citation - Scopus: 62
Application of Biological Domain Knowledge Based Feature Selection on Gene Expression Data
(MDPI, 2021) Yousef, Malik; Kumar, Abhishek; Bakir-Gungor, Burcu; 01. Abdullah Gül University; 02. 04. Bilgisayar Mühendisliği; 02. Mühendislik Fakültesi
In the last two decades, there have been massive advancements in high throughput technologies, which resulted in the exponential growth of public repositories of gene expression datasets for various phenotypes. It is possible to unravel biomarkers by comparing the gene expression levels under different conditions, such as disease vs. control, treated vs. not treated, drug A vs. drug B, etc. This problem refers to a well-studied problem in the machine learning domain, i.e., the feature selection problem. In biological data analysis, most of the computational feature selection methodologies were taken from other fields, without considering the nature of the biological data. Thus, integrative approaches that utilize the biological knowledge while performing feature selection are necessary for this kind of data. The main idea behind the integrative gene selection process is to generate a ranked list of genes considering both the statistical metrics that are applied to the gene expression data, and the biological background information which is provided as external datasets. One of the main goals of this review is to explore the existing methods that integrate different types of information in order to improve the identification of the biomolecular signatures of diseases and the discovery of new potential targets for treatment. These integrative approaches are expected to aid the prediction, diagnosis, and treatment of diseases, as well as to enlighten us on disease state dynamics, mechanisms of their onset and progression. The integration of various types of biological information will necessitate the development of novel techniques for integration and data analysis. Another aim of this review is to boost the bioinformatics community to develop new approaches for searching and determining significant groups/clusters of features based on one or more biological grouping functions.
Citation - WoS: 17
Citation - Scopus: 27
Blockchain for Genomics and Healthcare: A Literature Review, Current Status, Classification and Open Issues
(PeerJ Inc, 2021) Dedeturk, Beyhan Adanur; Soran, Ahmet; Bakir-Gungor, Burcu; 01. Abdullah Gül University; 02. 04. Bilgisayar Mühendisliği; 02. Mühendislik Fakültesi
The tremendous boost in the next generation sequencing technologies and in the "omics"technologies resulted in the generation of hundreds of gigabytes of data per day. Nowadays, via integrating -omics data with other data types, such as imaging and electronic health record (EHR) data, panomics studies attempt to identify novel and potentially actionable biomarkers for personalized medicine applications. In this respect, for the accurate analysis of -omics data and EHR, there is a need to establish secure and robust pipelines that take the ethical aspects into consideration, regulate privacy and ownership issues, and data sharing. These days, blockchain technology has picked up significant attention in diverse fields, including genomics, since it offers a new solution for these problems from a different perspective. Blockchain is an immutable transaction ledger, which offers secure and distributed system without a central authority. Within the system, each transaction can be expressed with cryptographically signed blocks, and the verification of transactions is performed by the users of the network. In this review, firstly, we aim to highlight the challenges of EHR and genomic data sharing. Secondly, we attempt to answer "Why"or "Why not"the blockchain technology is suitable for genomics and healthcare applications in detail. Thirdly, we elucidate the general blockchain structure based on the Ethereum, which is a more suitable technology for the genomic data sharing platforms. Fourthly, we review current blockchain-based EHR and genomic data sharing platforms, evaluate the advantages and disadvantages of these applications, and classify these applications using different metrics. Finally, we conclude by discussing the open issues and introducing our suggestion on the topic. In summary, to facilitate the diagnosis, monitoring and therapy of diseases with the effective analysis of -omics data with other available data types, through this review, we put forward the possible implications of the blockchain technology to life sciences and healthcare.
Citation - WoS: 4
Blockchain-Based Fog Computing Applications in Healthcare
(IEEE, 2020) Adanur, Beyhan; Bakir-Gungor, Burcu; Soran, Ahmet; 01. Abdullah Gül University; 02. 04. Bilgisayar Mühendisliği; 02. Mühendislik Fakültesi
Recently, the use of blockchain technology in the field of healthcare has increased. Although blockchain technology brought several innovations to healthcare, still there are problems waiting to be resolved. In order to provide alternative solutions to these problems, the use of fog computing together with blockchain technology has been proposed. In this study, the applications of blockchain based fog computing technology in healthcare are investigated. The aim of this study is to provide the readers an idea about the interactive use of blockchain and fog computing in the field of healthcare. For this purpose, firstly, fog computing and blockchain technologies are introduced. Afterwards, the integration of these areas, the advantages and disadvantages of using these technologies in the field of healthcare is discussed and a new system architecture is proposed.
Breast Cancer Detection Using a New Parallel Hybrid Logistic Regression Model Trained by Particle Swarm Optimization and Clonal Selection Algorithms
(Wiley, 2025) Etcil, Mustafa; Dedeturk, Bilge Kagan; Kolukisa, Burak; Bakir-Gungor, Burcu; Gungor, Vehbi Cagri; 01. Abdullah Gül University; 02. 04. Bilgisayar Mühendisliği; 02. Mühendislik Fakültesi
Breast cancer is one of the most widespread kinds of cancer, especially in women, and it has a high mortality rate. With the help of technology, it is possible to develop a computer-aided method for the diagnosis of breast cancer, which is crucial for effective treatment. Recent breast cancer diagnosis studies utilizing numerous machine learning models were efficient and innovative. However, it has been observed that they may have problems such as long training times and low accuracy rates. To this end, in this study, we present a new classifier that utilizes a hybrid of the clonal selection algorithm (CSA) and the particle swarm optimization (PSO) algorithm for the training of the logistic regression (LR) model, which is named CSA-PSO-LR. The proposed method is evaluated using two publicly accessible breast cancer datasets, that is, the Wisconsin Diagnostic Breast Cancer (WDBC) database and the Wisconsin Breast Cancer Database (WBCD), with 10-fold cross-validation and Bayesian hyperparameter optimization techniques. Additionally, a CPU parallelization method is applied, which substantially shortens the training time of the model. The efficacy of the CSA-PSO-LR classifier is compared with state-of-the-art machine learning algorithms and related studies in the literature. Performance analysis indicates that the proposed method achieves 98.75% accuracy and 98.27% F1-score on the WDBC dataset, and 97.94% accuracy and 97.35% F1-score on the WBCD dataset. These results demonstrate the potential of the proposed method as an effective approach for improving breast cancer diagnosis.
Citation - WoS: 1
Citation - Scopus: 2
Classification of Breast Cancer Molecular Subtypes With Grouping-Scoring Approach That Incorporates Disease-Disease Association Information
(IEEE, 2024) Qumsiyeh, Emma; Bakir-Gungor, Burcu; Yousef, Malik; 01. Abdullah Gül University; 02. 04. Bilgisayar Mühendisliği; 02. Mühendislik Fakültesi
This study uses modern sequencing technology and large biological databases to investigate the molecular intricacies of complicated diseases like cancer. Using gene expression databases and biomarkers, the research aims to improve breast cancer molecular subtype identification for better patient outcomes. Using BRCA LumAB_ Her2Basal dataset, this study compares an integrative machine learning-based strategy (GediNET) to traditional feature selection approaches across machine learning classifiers. GediNET excels at uncovering crucial disease-disease connections and potential biomarkers using the Grouping-Scoring-Modeling (GSM) approach, which favors gene groupings above individual genes. Our comparative analysis highlights GediNET's exceptional performance, notably in terms of accuracy and Area Under the Curve metrics, underscoring its effectiveness in uncovering the genetic intricacies of breast cancer. GediNET's promise to improve disease classification and biomarker identification by improving biological mechanism understanding goes beyond exceeding traditional approaches. The work shows that GediNET's integrative method can promote bioinformatics research by identifying the most informative genes associated with certain diseases, enabling focused and customized medicine.
Citation - WoS: 11
Citation - Scopus: 10
Clinical and Molecular Evaluation of MEFV Gene Variants in the Turkish Population: A Study by the National Genetics Consortium
(Springer Heidelberg, 2022) Dundar, Munis; Fahrioglu, Umut; Yildiz, Saliha Handan; Bakir-Gungor, Burcu; Temel, Sehime Gulsun; Akin, Haluk; Erdem, Levent; 01. Abdullah Gül University; 02. 04. Bilgisayar Mühendisliği; 02. Mühendislik Fakültesi
Familial Mediterranean fever (FMF) is a monogenic autoinflammatory disorder with recurrent fever, abdominal pain, serositis, articular manifestations, erysipelas-like erythema, and renal complications as its main features. Caused by the mutations in the MEditerranean FeVer (MEFV) gene, it mainly affects people of Mediterranean descent with a higher incidence in the Turkish, Jewish, Arabic, and Armenian populations. As our understanding of FMF improves, it becomes clearer that we are facing with a more complex picture of FMF with respect to its pathogenesis, penetrance, variant type (gain-of-function vs. loss-of-function), and inheritance. In this study, MEFV gene analysis results and clinical findings of 27,504 patients from 35 universities and institutions in Turkey and Northern Cyprus are combined in an effort to provide a better insight into the genotype-phenotype correlation and how a specific variant contributes to certain clinical findings in FMF patients. Our results may help better understand this complex disease and how the genotype may sometimes contribute to phenotype. Unlike many studies in the literature, our study investigated a broader symptomatic spectrum and the relationship between the genotype and phenotype data. In this sense, we aimed to guide all clinicians and academicians who work in this field to better establish a comprehensive data set for the patients. One of the biggest messages of our study is that lack of uniformity in some clinical and demographic data of participants may become an obstacle in approaching FMF patients and understanding this complex disease.
Colorectal Cancer Prediction via Applying Recursive Cluster Elimination With Intra-Cluster Feature Elimination on Metagenomic Pathway Data
(Springer International Publishing AG, 2024) Temiz, Mustafa; Kuzudisli, Cihan; Yousef, Malik; Bakir-Gungor, Burcu; 01. Abdullah Gül University; 02. 04. Bilgisayar Mühendisliği; 02. Mühendislik Fakültesi
Advances in next-generation sequencing and in "-omics" technologies enable the characterization of the human gut microbiome. Colorectal cancer (CRC), the third most common cancer worldwide, is caused by genetic mutations, environmental influences, and abnormalities in the gut microbiota. The aim of this study is to identify pathways that influence host metabolism in CRC patients. The CRC-related metagenomic dataset used in this study contains the relative abundance values of 551 pathways calculated for 1262 samples. Here, two different approaches based on the feature grouping reduce the number of features by considering relevant features as groups, eliminate irrelevant features, and perform classification. The recursive cluster elimination with intra-cluster feature elimination (RCE-IFE) approach achieves anAUCof 0.72 using an average of 66.2 features on CRC-associated metagenomics dataset. In these experiments, P163-PWY: L-lysine fermentation to acetate and butanoate and PWY-6151: S-adenosyl-L-methionine cycle I pathways are identified as potential biomarkers associated with CRC. These experiments also reduce the number of features reported by both approaches in P163-PWY: L-lysine fermentation to acetate and butanoate and PWY-6151: Sadenosyl-L-methionine cycle I pathways reported by both approaches are considered possible CRC-related biomarkers. This study contributes to the molecular diagnosis and treatment of colorectal cancer by revealing the pathways associated with CRC. Our results are promising for the study of the gut microbiota and its role in CRC.
Correction: Engineering Novel Features for Diabetes Complication Prediction Using Synthetic Electronic Health Records
(Frontiers Media S.A., 2025) Voskergian, Daniel; Bakir-Gungor, Burcu; Yousef, Malik; 01. Abdullah Gül University; 02. 04. Bilgisayar Mühendisliği; 02. Mühendislik Fakültesi
Citation - WoS: 2
CSA-DE-LR: Enhancing Cardiovascular Disease Diagnosis With a Novel Hybrid Machine Learning Approach
(PeerJ Inc, 2024) Dedeturk, Beyhan Adanur; Dedeturk, Bilge Kagan; Bakir-Gungor, Burcu; 01. Abdullah Gül University; 02. 04. Bilgisayar Mühendisliği; 02. Mühendislik Fakültesi
Cardiovascular diseases (CVD) are a leading cause of mortality globally, necessitating the development of efficient diagnostic tools. Machine learning (ML) and metaheuristic algorithms have become prevalent in addressing these challenges, providing promising solutions in medical diagnostics. However, traditional ML approaches often need to be improved in feature selection and optimization, leading to suboptimal performance in complex diagnostic tasks. To overcome these limitations, this study introduces a new hybrid method called CSA-DE-LR, which combines the clonal selection algorithm (CSA) and differential evolution (DE) with logistic regression. This integration is designed to optimize logistic regression weights efficiently for the accurate classification of CVD. The methodology employs three optimization strategies based on the F1 score, the Matthews correlation coefficient (MCC), and the mean absolute error (MAE). Extensive evaluations on benchmark datasets, namely Cleveland and Statlog, reveal that CSA-DELR outperforms state-of-the-art ML methods. In addition, generalization is evaluated using the Breast Cancer Wisconsin Original (WBCO) and Breast Cancer Wisconsin Diagnostic (WBCD) datasets. Significantly, the proposed model demonstrates superior efficacy compared to previous research studies in this domain. This study's findings highlight the potential of hybrid machine learning approaches for improving diagnostic accuracy, offering a significant advancement in the fields of medical data analysis and CVD diagnosis.
Citation - WoS: 5
Citation - Scopus: 3
Defect Classification of Composite Materials Using Transfer Learning Methods
(Taylor & Francis Ltd, 2025) Gulsen, Abdulkadir; Kolukisa, Burak; Ozdemir, Ahmet Turan; Bakir-Gungor, Burcu; Gungor, Vehbi Cagri; 01. Abdullah Gül University; 02. 04. Bilgisayar Mühendisliği; 02. Mühendislik Fakültesi
Nowadays, composite materials have become prevalent across various sectors, particularly finding usage in large-scale applications such as spaceships, automobiles, and aircrafts. The accurate detection of the defects in these materials is crucial, yet traditional methods often rely on human inspection, which is susceptible to errors. Recent advancements in machine learning have enabled defect detection using ultrasonic non-destructive testing methods. This paper introduces a new dataset named UNDT, which is obtained from the scans of 60 different composite materials, generating a total of 1150 images depicting both defective and non-defective areas. Several transfer learning methods are applied on the newly introduced UNDT dataset as well as the publicly available USimgAIST ultrasonic dataset. Comparative performance assessments illustrate the significance of utilising the transfer learning approach for defect classification on ultrasonic inspection images. Furthermore, the research emphasises the substantial benefits of employing these transfer learning methods. Notably, the DenseNet121 and VGG19 models achieve the highest accuracy rates, with 98.8% and 98.6% on the UNDT and USimgAIST datasets, respectively.
Citation - WoS: 4
Citation - Scopus: 6
The Determination of Distinctive Single Nucleotide Polymorphism Sets for the Diagnosis of Behcet's Disease
(IEEE Computer Soc, 2022) Isik, Yunus Emre; Gormez, Yasin; Aydin, Zafer; Bakir-Gungor, Burcu; 01. Abdullah Gül University; 02. 04. Bilgisayar Mühendisliği; 02. Mühendislik Fakültesi
Behcet's Disease (BD) is a multi-system inflammatory disorder in which the etiology remains unclear. The most probable hypothesis is that genetic tendency and environmental factors play roles in the development of BD. In order to find the essential reasons, genetic changes on thousands of genes should be analyzed. Besides, there is a need for extra analysis to find out which genetic factor affects the disease. Machine learning approaches have high potential for extracting the knowledge from genomics and selecting the representative Single Nucleotide Polymorphisms (SNPs) as the most effective features for the clinical diagnosis process. In this study, we have attempted to identify representative SNPs using feature selection methods, incorporating biological information and aimed to develop a machine-learning model for diagnosing Behcet's disease. By combining biological information and machine learning classifiers, up to 99.64 percent accuracy of disease prediction is achieved using only 13,611 out of 311,459 SNPs. In addition, we revealed the SNPs that are most distinctive by performing repeated feature selection in cross-validation experiments.
Developing a Label Propagation Approach for Cancer Subtype Classification Problem
(Tubitak Scientific & Technological Research Council Turkey, 2022) Guner, Pinar; Bakir-Gungor, Burcu; Coskun, Mustafa; 02. 04. Bilgisayar Mühendisliği; 01. Abdullah Gül University; 02. Mühendislik Fakültesi
Cancer is a disease in which abnormal cells grow uncontrollably and invade other tissues. Several types of cancer have various subtypes with different clinical and biological implications. Based on these differences, treatment methods need to be customized. The identification of distinct cancer subtypes is an important problem in bioinformatics, since it can guide future precision medicine applications. In order to design targeted treatments, bioinformatics methods attempt to discover common molecular pathology of different cancer subtypes. Along this line, several computational methods have been proposed to discover cancer subtypes or to stratify cancer into informative subtypes. However, existing works do not consider the sparseness of data (genes having low degrees) and result in an ill-conditioned solution. To address this shortcoming, in this paper, we propose an alternative unsupervised method to stratify cancer patients into subtypes using applied numerical algebra techniques. More specifically, we applied a label propagation based approach to stratify somatic mutation profiles of colon, head and neck, uterine, bladder, and breast tumors. We evaluated the performance of our method by comparing it to the baseline methods. Extensive experiments demonstrate that our approach highly renders tumor classification tasks by largely outperforming the state-of-the-art unsupervised and supervised approaches.
Citation - WoS: 21
Citation - Scopus: 28
Discovering Potential Taxonomic Biomarkers of Type 2 Diabetes From Human Gut Microbiota via Different Feature Selection Methods
(Frontiers Media S.A., 2021) Bakir-Gungor, Burcu; Bulut, Osman; Jabeer, Amhar; Nalbantoglu, O. Ufuk; Yousef, Malik; 01. Abdullah Gül University; 02. 04. Bilgisayar Mühendisliği; 02. Mühendislik Fakültesi
Human gut microbiota is a complex community of organisms including trillions of bacteria. While these microorganisms are considered as essential regulators of our immune system, some of them can cause several diseases. In recent years, next-generation sequencing technologies accelerated the discovery of human gut microbiota. In this respect, the use of machine learning techniques became popular to analyze disease-associated metagenomics datasets. Type 2 diabetes (T2D) is a chronic disease and affects millions of people around the world. Since the early diagnosis in T2D is important for effective treatment, there is an utmost need to develop a classification technique that can accelerate T2D diagnosis. In this study, using T2D-associated metagenomics data, we aim to develop a classification model to facilitate T2D diagnosis and to discover T2D-associated biomarkers. The sequencing data of T2D patients and healthy individuals were taken from a metagenome-wide association study and categorized into disease states. The sequencing reads were assigned to taxa, and the identified species are used to train and test our model. To deal with the high dimensionality of features, we applied robust feature selection algorithms such as Conditional Mutual Information Maximization, Maximum Relevance and Minimum Redundancy, Correlation Based Feature Selection, and select K best approach. To test the performance of the classification based on the features that are selected by different methods, we used random forest classifier with 100-fold Monte Carlo cross-validation. In our experiments, we observed that 15 commonly selected features have a considerable effect in terms of minimizing the microbiota used for the diagnosis of T2D and thus reducing the time and cost. When we perform biological validation of these identified species, we found that some of them are known as related to T2D development mechanisms and we identified additional species as potential biomarkers. Additionally, we attempted to find the subgroups of T2D patients using k-means clustering. In summary, this study utilizes several supervised and unsupervised machine learning algorithms to increase the diagnostic accuracy of T2D, investigates potential biomarkers of T2D, and finds out which subset of microbiota is more informative than other taxa by applying state-of-the art feature selection methods.
Citation - Scopus: 1
Engineering Novel Features for Diabetes Complication Prediction Using Synthetic Electronic Health Records
(Frontiers Media S.A., 2025) Voskergian, Daniel; Bakir-Gungor, Burcu; Yousef, Malik; 01. Abdullah Gül University; 02. 04. Bilgisayar Mühendisliği; 02. Mühendislik Fakültesi
Diabetes significantly affects millions of people worldwide, leading to substantial morbidity, disability, and mortality rates. Predicting diabetes-related complications from health records is crucial for early prevention and for the development of effective treatment plans. In order to predict four different complications of diabetes mellitus, i.e., retinopathy, chronic kidney disease, ischemic heart disease, and amputations, this study introduces a novel feature engineering approach. While developing the classification models, we utilize XGBoost feature selection method and various supervised machine learning algorithms, including Random Forest, XGBoost, LogitBoost, AdaBoost, and Decision Tree. These models were trained on synthetic electronic health records (EHR) generated by dual-adversarial autoencoders. These EHRs represent nearly 1 million synthetic patients derived from an authentic cohort of 979,308 individuals with diabetes. The variables considered in the models were the age range accompanied by chronic diseases that occur during patient visits starting from the onset of diabetes. Throughout the experiments, XGBoost and Random Forest demonstrated the best overall prediction performance. The final models, which are tailored to each complication and trained using our feature engineering approach, achieved an accuracy between 69% and 77% and an AUC between 77% and 84% using cross-validation, while the partitioned validation approach yielded an accuracy between 59% and 78% and an AUC between 66% and 85%. These findings imply that the performance of our method surpass the performance of the traditional Bag-of-Features approach, highlighting the effectiveness of our approach in enhancing model accuracy and robustness.
Enlightening the Molecular Mechanisms of Type 2 Diabetes With a Novel Pathway Clustering and Pathway Subnetwork Approach
(Tubitak Scientific & Technological Research Council Turkey, 2022) Bakir-Gungor, Burcu; Yazici, Miray Unlu; Goy, Gokhan; Temiz, Mustafa; 01. Abdullah Gül University; 02. 04. Bilgisayar Mühendisliği; 02. Mühendislik Fakültesi; 04. Yaşam ve Doğa Bilimleri Fakültesi; 04.01. Biyomühendislik
Type 2 diabetes mellitus (T2D) constitutes 90% of the diabetes cases, and it is a complex multifactorial disease. In the last decade, genome-wide association studies (GWASs) for T2D successfully pinpointed the genetic variants (typically single nucleotide polymorphisms, SNPs) that associate with disease risk. In order to diminish the burden of multiple testing in GWAS, researchers attempted to evaluate the collective effects of interesting variants. In this regard, pathway-based analyses of GWAS became popular to discover novel multigenic functional associations. Still, to reveal the unaccounted 85 to 90% of T2D variation, which lies hidden in GWAS datasets, new post-GWAS strategies need to be developed. In this respect, here we reanalyze three metaanalysis data of GWAS in T2D, using the methodology that we have developed to identify disease-associated pathways by combining nominally significant evidence of genetic association with the known biochemical pathways, protein-protein interaction (PPI) networks, and the functional information of selected SNPs. In this research effort, to enlighten the molecular mechanisms underlying T2D development and progress, we integrated different in silico approaches that proceed in top-down manner and bottom-up manner, and presented a comprehensive analysis at protein subnetwork, pathway, and pathway subnetwork levels. Using the mutual information based on the shared genes, the identified protein subnetworks and the affected pathways of each dataset were compared. While most of the identified pathways recapitulate the pathophysiology of T2D, our results show that incorporating SNP functional properties, PPI networks into GWAS can dissect leading molecular pathways, and it could offer improvement over traditional enrichment strategies.
Citation - WoS: 36
Citation - Scopus: 57
Ensemble Feature Selection and Classification Methods for Machine Learning-Based Coronary Artery Disease Diagnosis
(Elsevier, 2023) Kolukisa, Burak; Bakir-Gungor, Burcu; 01. Abdullah Gül University; 02. 04. Bilgisayar Mühendisliği; 02. Mühendislik Fakültesi
Coronary artery disease (CAD) is a condition in which the heart is not fed sufficiently as a result of the accumulation of fatty matter. As reported by the World Health Organization, around 32% of the total deaths in the world are caused by CAD, and it is estimated that approximately 23.6 million people will die from this disease in 2030. CAD develops over time, and the diagnosis of this disease is difficult until a blockage or a heart attack occurs. In order to bypass the side effects and high costs of the current methods, researchers have proposed to diagnose CADs with computer-aided systems, which analyze some physical and biochemical values at a lower cost. In this study, for the CAD diagnosis, (i) seven different computational feature selection (FS) methods, one domain knowledge-based FS method, and different classification algorithms have been evaluated; (ii) an exhaustive ensemble FS method and a probabilistic ensemble FS method have been proposed. The proposed approach is tested on three publicly available CAD data sets using six different classification algorithms and four different variants of voting algorithms. The performance metrics have been comparatively evaluated with numerous combinations of classifiers and FS methods. The multi-layer perceptron classifier obtained satisfactory results on three data sets. Performance evaluations show that the proposed approach resulted in 91.78%, 85.55%, and 85.47% accuracy for the Z-Alizadeh Sani, Statlog, and Cleveland data sets, respectively.