Browsing by Author "Yousef, Malik"
Now showing 1 - 20 of 51
- Results Per Page
- Sort Options
Article 3Mont: A Multi-Omics Integrative Tool for Breast Cancer Subtype Stratification(Public Library Science, 2025) Unlu Yazici, Miray; Marron, J. S.; Bakir-Gungor, Burcu; Zou, Fei; Yousef, Malik; Yazici, Miray UnluBreast Cancer (BRCA) is a heterogeneous disease, and it is one of the most prevalent cancer types among women. Developing effective treatment strategies that address diverse types of BRCA is crucial. Notably, among different BRCA molecular sub-types, Hormone Receptor negative (HR-) BRCA cases, especially Basal-like BRCA sub-types, lack estrogen and progesterone hormone receptors and they exhibit a higher tumor growth rate compared to HR+ cases. Improving survival time and predicting prognosis for distinct molecular profiles is substantial. In this study, we propose a novel approach called 3-Multi-Omics Network and Integration Tool (3Mont), which integrates various -omics data by applying a grouping function, detecting pro-groups, and assigning scores to each pro-group using Feature importance scoring (FIS) component. Following that, machine learning (ML) models are constructed based on the prominent pro-groups, which enable the extraction of promising biomarkers for distinguishing BRCA sub-types. Our tool allows users to analyze the collective behavior of features in each pro-group (biological groups) utilizing ML algorithms. In addition, by constructing the pro-groups and equalizing the feature numbers in each pro-group using the FIS component, this process achieves a significant 20% speedup over the 3Mint tool. Contrary to conventional methods, 3Mont generates networks that illustrate the interplay of the prominent biomarkers of different -omics data. Accordingly, exploring the concerted actions of features in pro-groups facilitates understanding the dynamics of the biomarkers within the generated networks and developing effective strategies for better cancer sub-type stratification. The 3Mont tool, along with all supporting materials, can be found at https://github.com/malikyousef/3Mont.git.Article Citation - WoS: 9Citation - Scopus: 12AMP-GSM: Prediction of Antimicrobial Peptides via a Grouping-Scoring Approach(MDPI, 2023) Soylemez, Ummu Gulsum; Yousef, Malik; Bakir-Gungor, BurcuDue to the increasing resistance of bacteria to antibiotics, scientists began seeking new solutions against this problem. One of the most promising solutions in this field are antimicrobial peptides (AMP). To identify antimicrobial peptides, and to aid the design and production of novel antimicrobial peptides, there is a growing interest in the development of computational prediction approaches, in parallel with the studies performing wet-lab experiments. The computational approaches aim to understand what controls antimicrobial activity from the perspective of machine learning, and to uncover the biological properties that define antimicrobial activity. Throughout this study, we aim to develop a novel prediction approach that can identify peptides with high antimicrobial activity against selected target bacteria. Along this line, we propose a novel method called AMP-GSM (antimicrobial peptide-grouping-scoring-modeling). AMP-GSM includes three main components: grouping, scoring, and modeling. The grouping component creates sub-datasets via placing the physicochemical, linguistic, sequence, and structure-based features into different groups. The scoring component gives a score for each group according to their ability to distinguish whether it is an antimicrobial peptide or not. As the final part of our method, the model built using the top-ranked groups is evaluated (modeling component). The method was tested for three AMP prediction datasets, and the prediction performance of AMP-GSM was comparatively evaluated with several feature selection methods and several classifiers. When we used 10 features (which are members of the physicochemical group), we obtained the highest area under curve (AUC) value for both the Gram-negative (99%) and Gram-positive (98%) datasets. AMP-GSM investigates the most significant feature groups that improve AMP prediction. A number of physico-chemical features from the AMP-GSM's final selection demonstrate how important these variables are in terms of defining peptide characteristics and how they should be taken into account when creating models to predict peptide activity.Article Citation - WoS: 52Citation - Scopus: 64Application of Biological Domain Knowledge Based Feature Selection on Gene Expression Data(MDPI, 2021) Yousef, Malik; Kumar, Abhishek; Bakir-Gungor, BurcuIn the last two decades, there have been massive advancements in high throughput technologies, which resulted in the exponential growth of public repositories of gene expression datasets for various phenotypes. It is possible to unravel biomarkers by comparing the gene expression levels under different conditions, such as disease vs. control, treated vs. not treated, drug A vs. drug B, etc. This problem refers to a well-studied problem in the machine learning domain, i.e., the feature selection problem. In biological data analysis, most of the computational feature selection methodologies were taken from other fields, without considering the nature of the biological data. Thus, integrative approaches that utilize the biological knowledge while performing feature selection are necessary for this kind of data. The main idea behind the integrative gene selection process is to generate a ranked list of genes considering both the statistical metrics that are applied to the gene expression data, and the biological background information which is provided as external datasets. One of the main goals of this review is to explore the existing methods that integrate different types of information in order to improve the identification of the biomolecular signatures of diseases and the discovery of new potential targets for treatment. These integrative approaches are expected to aid the prediction, diagnosis, and treatment of diseases, as well as to enlighten us on disease state dynamics, mechanisms of their onset and progression. The integration of various types of biological information will necessitate the development of novel techniques for integration and data analysis. Another aim of this review is to boost the bioinformatics community to develop new approaches for searching and determining significant groups/clusters of features based on one or more biological grouping functions.Article Citation - Scopus: 4CCPred: Global and Population-Specific Colorectal Cancer Prediction and Metagenomic Biomarker Identification at Different Molecular Levels Using Machine Learning Techniques(Elsevier Ltd, 2024) Bakir-Güngör, Burcu; Temiz, Mustafa; Inal, Yasin; Cicekyurt, Emre; Yousef, MalikColorectal cancer (CRC) ranks as the third most common cancer globally and the second leading cause of cancer-related deaths. Recent research highlights the pivotal role of the gut microbiota in CRC development and progression. Understanding the complex interplay between disease development and metagenomic data is essential for CRC diagnosis and treatment. Current computational models employ machine learning to identify metagenomic biomarkers associated with CRC, yet there is a need to improve their accuracy through a holistic biological knowledge perspective. This study aims to evaluate CRC-associated metagenomic data at species, enzymes, and pathway levels via conducting global and population-specific analyses. These analyses utilize relative abundance values from human gut microbiome sequencing data and robust classification models are built for disease prediction and biomarker identification. For global CRC prediction and biomarker identification, the features that are identified by SelectKBest (SKB), Information Gain (IG), and Extreme Gradient Boosting (XGBoost) methods are combined. Population-based analysis includes within-population, leave-one-dataset-out (LODO) and cross-population approaches. Four classification algorithms are employed for CRC classification. Random Forest achieved an AUC of 0.83 for species data, 0.78 for enzyme data and 0.76 for pathway data globally. On the global scale, potential taxonomic biomarkers include ruthenibacterium lactatiformanas; enzyme biomarkers include RNA 2′ 3′ cyclic 3′ phosphodiesterase; and pathway biomarkers include pyruvate fermentation to acetone pathway. This study underscores the potential of machine learning models trained on metagenomic data for improved disease prediction and biomarker discovery. The proposed model and associated files are available at https://github.com/TemizMus/CCPRED. © 2024 Elsevier B.V., All rights reserved.Conference Object Citation - WoS: 2Citation - Scopus: 2Classification of Breast Cancer Molecular Subtypes With Grouping-Scoring Approach That Incorporates Disease-Disease Association Information(IEEE, 2024) Qumsiyeh, Emma; Bakir-Gungor, Burcu; Yousef, MalikThis study uses modern sequencing technology and large biological databases to investigate the molecular intricacies of complicated diseases like cancer. Using gene expression databases and biomarkers, the research aims to improve breast cancer molecular subtype identification for better patient outcomes. Using BRCA LumAB_ Her2Basal dataset, this study compares an integrative machine learning-based strategy (GediNET) to traditional feature selection approaches across machine learning classifiers. GediNET excels at uncovering crucial disease-disease connections and potential biomarkers using the Grouping-Scoring-Modeling (GSM) approach, which favors gene groupings above individual genes. Our comparative analysis highlights GediNET's exceptional performance, notably in terms of accuracy and Area Under the Curve metrics, underscoring its effectiveness in uncovering the genetic intricacies of breast cancer. GediNET's promise to improve disease classification and biomarker identification by improving biological mechanism understanding goes beyond exceeding traditional approaches. The work shows that GediNET's integrative method can promote bioinformatics research by identifying the most informative genes associated with certain diseases, enabling focused and customized medicine.Conference Object Colorectal Cancer Prediction via Applying Recursive Cluster Elimination With Intra-Cluster Feature Elimination on Metagenomic Pathway Data(Springer International Publishing AG, 2024) Temiz, Mustafa; Kuzudisli, Cihan; Yousef, Malik; Bakir-Gungor, BurcuAdvances in next-generation sequencing and in "-omics" technologies enable the characterization of the human gut microbiome. Colorectal cancer (CRC), the third most common cancer worldwide, is caused by genetic mutations, environmental influences, and abnormalities in the gut microbiota. The aim of this study is to identify pathways that influence host metabolism in CRC patients. The CRC-related metagenomic dataset used in this study contains the relative abundance values of 551 pathways calculated for 1262 samples. Here, two different approaches based on the feature grouping reduce the number of features by considering relevant features as groups, eliminate irrelevant features, and perform classification. The recursive cluster elimination with intra-cluster feature elimination (RCE-IFE) approach achieves anAUCof 0.72 using an average of 66.2 features on CRC-associated metagenomics dataset. In these experiments, P163-PWY: L-lysine fermentation to acetate and butanoate and PWY-6151: S-adenosyl-L-methionine cycle I pathways are identified as potential biomarkers associated with CRC. These experiments also reduce the number of features reported by both approaches in P163-PWY: L-lysine fermentation to acetate and butanoate and PWY-6151: Sadenosyl-L-methionine cycle I pathways reported by both approaches are considered possible CRC-related biomarkers. This study contributes to the molecular diagnosis and treatment of colorectal cancer by revealing the pathways associated with CRC. Our results are promising for the study of the gut microbiota and its role in CRC.Book Part Citation - WoS: 19Citation - Scopus: 26Computational Prediction of Functional MicroRNA-mRNA Interactions(Humana Press Inc, 2019) Demirci, Muserref Duygu Sacar; Yousef, Malik; Allmer, Jens; Saçar Demirci, Müşerref DuyguProteins have a strong influence on the phenotype and their aberrant expression leads to diseases. MicroRNAs (miRNAs) are short RNA sequences which posttranscriptionally regulate protein expression. This regulation is driven by miRNAs acting as recognition sequences for their target mRNAs within a larger regulatory machinery. A miRNA can have many target mRNAs and an mRNA can be targeted by many miRNAs which makes it difficult to experimentally discover all miRNA-mRNA interactions. Therefore, computational methods have been developed for miRNA detection and miRNA target prediction. An abundance of available computational tools makes selection difficult. Additionally, interactions are not currently the focus of investigation although they more accurately define the regulation than pre-miRNA detection or target prediction could perform alone. We define an interaction including the miRNA source and the mRNA target. We present computational methods allowing the investigation of these interactions as well as how they can be used to extend regulatory pathways. Finally, we present a list of points that should be taken into account when investigating miRNA-mRNA interactions. In the future, this may lead to better understanding of functional interactions which may pave the way for disease marker discovery and design of miRNA-based drugs.Correction Correction: Engineering Novel Features for Diabetes Complication Prediction Using Synthetic Electronic Health Records(Frontiers Media S.A., 2025) Voskergian, Daniel; Bakir-Gungor, Burcu; Yousef, MalikArticle Citation - WoS: 23Citation - Scopus: 28Discovering Potential Taxonomic Biomarkers of Type 2 Diabetes From Human Gut Microbiota via Different Feature Selection Methods(Frontiers Media S.A., 2021) Bakir-Gungor, Burcu; Bulut, Osman; Jabeer, Amhar; Nalbantoglu, O. Ufuk; Yousef, MalikHuman gut microbiota is a complex community of organisms including trillions of bacteria. While these microorganisms are considered as essential regulators of our immune system, some of them can cause several diseases. In recent years, next-generation sequencing technologies accelerated the discovery of human gut microbiota. In this respect, the use of machine learning techniques became popular to analyze disease-associated metagenomics datasets. Type 2 diabetes (T2D) is a chronic disease and affects millions of people around the world. Since the early diagnosis in T2D is important for effective treatment, there is an utmost need to develop a classification technique that can accelerate T2D diagnosis. In this study, using T2D-associated metagenomics data, we aim to develop a classification model to facilitate T2D diagnosis and to discover T2D-associated biomarkers. The sequencing data of T2D patients and healthy individuals were taken from a metagenome-wide association study and categorized into disease states. The sequencing reads were assigned to taxa, and the identified species are used to train and test our model. To deal with the high dimensionality of features, we applied robust feature selection algorithms such as Conditional Mutual Information Maximization, Maximum Relevance and Minimum Redundancy, Correlation Based Feature Selection, and select K best approach. To test the performance of the classification based on the features that are selected by different methods, we used random forest classifier with 100-fold Monte Carlo cross-validation. In our experiments, we observed that 15 commonly selected features have a considerable effect in terms of minimizing the microbiota used for the diagnosis of T2D and thus reducing the time and cost. When we perform biological validation of these identified species, we found that some of them are known as related to T2D development mechanisms and we identified additional species as potential biomarkers. Additionally, we attempted to find the subgroups of T2D patients using k-means clustering. In summary, this study utilizes several supervised and unsupervised machine learning algorithms to increase the diagnostic accuracy of T2D, investigates potential biomarkers of T2D, and finds out which subset of microbiota is more informative than other taxa by applying state-of-the art feature selection methods.Conference Object The Effect of Different Classifiers on Recursive Cluster Elimination in the Analysis of Transcriptomic Data(Institute of Electrical and Electronics Engineers Inc., 2023) Bulut, Nurten; Bakir-Güngör, Burcu; Qaqish, Bahjat F.; Yousef, MalikGene expression data with limited sample size and a large number of genes are frequently encountered in genetic studies. In such high-dimensional data, identification of genes that distinguish between disease states is a challenging task. Feature selection (FS) is a useful approach in dealing with high dimensionality. Support Vector Machines Recursive Cluster Elimination (SVM-RCE) is a technique for FS in high-dimensional data. The SVM-RCE approach has been utilized for identification of clusters of genes whose expression levels correlate with pathological state. A key step in SVM-RCE is the use of an SVM classifier to assign an area under the curve (AUC) score to each gene cluster based on its ability to predict class labels. In this study, we investigate the use of alternative classifiers in the cluster-scoring step. Specifically, we compare Support Vector Machines, Random Forest, XgBoost, Naive Bayes, and linear logistic regression. In addition to AUC score performance evaluation, the algorithms are compared in terms of the number of selected genes at different levels of clustering and in terms of the running time. © 2023 Elsevier B.V., All rights reserved.Conference Object Citation - Scopus: 2Effect of Recursive Cluster Elimination With Different Clustering Algorithms Applied to Gene Expression Data(Institute of Electrical and Electronics Engineers Inc., 2023) Kuzudisli, Cihan; Bakir-Güngör, Burcu; Qaqish, Bahjat F.; Yousef, MalikFeature selection (FS) is an effective tool in dealing with high dimensionality and reducing computational cost. Support Vector Machines-Recursive Cluster Elimination (SVM-RCE) is one of several algorithms that have been developed for FS in high dimensional data. SVM-RCE involves a clustering step which originally is k-means. Using various performance metrics, three alternative algorithms are evaluated in this context; k-medoids, Hierarchical Clustering (HC), and Gaussian Mixture Model (GMM). Comparisons will be carried out on five publicly available gene expression datasets. The results show that k-means in SVM-RCE obtains higher performance than other tested algorithms in terms of classification performance. Additionally, HC shows a similar performance to k-means. Our findings show superiority of using k-means. This study can contribute to the development of SVM-RCE with different variations, leading to decrease in the number of selected genes, and an increase in prediction performance. © 2023 Elsevier B.V., All rights reserved.Article Citation - WoS: 1Citation - Scopus: 2Engineering Novel Features for Diabetes Complication Prediction Using Synthetic Electronic Health Records(Frontiers Media S.A., 2025) Voskergian, Daniel; Bakir-Gungor, Burcu; Yousef, MalikDiabetes significantly affects millions of people worldwide, leading to substantial morbidity, disability, and mortality rates. Predicting diabetes-related complications from health records is crucial for early prevention and for the development of effective treatment plans. In order to predict four different complications of diabetes mellitus, i.e., retinopathy, chronic kidney disease, ischemic heart disease, and amputations, this study introduces a novel feature engineering approach. While developing the classification models, we utilize XGBoost feature selection method and various supervised machine learning algorithms, including Random Forest, XGBoost, LogitBoost, AdaBoost, and Decision Tree. These models were trained on synthetic electronic health records (EHR) generated by dual-adversarial autoencoders. These EHRs represent nearly 1 million synthetic patients derived from an authentic cohort of 979,308 individuals with diabetes. The variables considered in the models were the age range accompanied by chronic diseases that occur during patient visits starting from the onset of diabetes. Throughout the experiments, XGBoost and Random Forest demonstrated the best overall prediction performance. The final models, which are tailored to each complication and trained using our feature engineering approach, achieved an accuracy between 69% and 77% and an AUC between 77% and 84% using cross-validation, while the partitioned validation approach yielded an accuracy between 59% and 78% and an AUC between 66% and 85%. These findings imply that the performance of our method surpass the performance of the traditional Bag-of-Features approach, highlighting the effectiveness of our approach in enhancing model accuracy and robustness.Conference Object Enhancing Complex Disease Group Scoring with Mirgedinet: A Multi-Algorithm Machine Learning Framework Based on the GSM Approach(IEEE, 2025) Qumsiyeh, Emma; Bakir-Gungor, Burcu; Yousef, MalikIntegrating biological prior knowledge for disease gene associations has shown significant promise in discovering new biomarkers with potential translational applications. This work investigates the application of a multi-algorithm machine learning framework based on the Grouping-Scoring-Modeling (G-S-M) approach for improving the prediction of complex diseases. The study identifies the primary gene and miRNA interactions in various complex diseases with the help of miRGediNET, which is a machine-learning based tool that integrates data from three biological databases. Traditional methods have only focused on independence between features; the G-S-M method focuses on aggregating genes based on biological interactions, pinpointing the scoring of gene groups for a disease, and modeling its predictive capability using advanced machine learning algorithms. In this research paper, seven algorithms, including Support Vector Machine, Decision Tree, and CatBoost, were applied to eight datasets extracted from the GEO database. This framework proved very robust in ranking gene clusters, thus predicting critical biomarkers while doing 100-fold randomized cross-validation within the evaluation. The results indicate this approach's high potential for refining disease and supporting research for choosing the best algorithm that can provide biological insights and computational advances.Conference Object Enhancing Gene Expression Data Analysis Through SVM-Based Recursive Cluster Elimination and Weighted Center Approaches(Avestia Publishing, 2024) Yousef, Malik; Bulut, Nurten; Gungor, Burcu Bakir; Qaqish, Bahjat F.The complexity and high dimensionality of gene expression data pose significant challenges for effective feature selection and accurate classification in bioinformatics. This study introduces two novel algorithms, Support Vector Machine-Recursive Cluster Elimination (SVM-RCE) and its advanced version, SVM-RCE with Center Weights (SVM-RCE-CW), designed to optimize feature selection by leveraging clustering techniques and machine learning models. Both algorithms aim to reduce the feature space, thereby enhancing the interpretability and performance of classification models. We present a comprehensive comparison of these methods against traditional feature selection techniques, demonstrating their efficacy in achieving significant dimensionality reduction while maintaining or improving classification accuracy in several gene expression datasets. © 2024 Elsevier B.V., All rights reserved.Conference Object Exploring Microbiome Signatures in Autism Spectrum Disorder via Grouping-Scoring Based Machine Learning(IEEE, 2025) Temiz, Mustafa; Ersoz, Nur Sebnem; Yousef, Malik; Bakir-Gungor, BurcuThe rapid increase in omic data production increased the importance of machine learning (ML) methods to analze these data. In particular, the use of metagenomic data in the diagnosis, prognosis and treatment of diseases is becoming widespread. Autism Spectrum Disorder (ASD) is a neurodevelopmental disease that occurs in early childhood and continues lifelong. The aim of this study is to increase ML performance, reduce computational costs and achieve successful classification performance using a small number of metagenomic features. In addition, disease prediction is performed; ASD associated biomarkers are determined using the microBiomeGSM on metagenomic data. Classification is performed at three different taxonomic levels (genus, family and order) using the relative abundance values of species. The best performance metric (0.95 AUC) was obtained at the order taxonomic level using an average of 416 features with microBiomeGSM. The identified ASD-related taxonomic species are presented.Article G-S a Prior Biological Knowledge-Based Pattern Detection and Enrichment Framework for Multi-Omics Data Integration(MDPI, 2025) Unlu Yazici, Miray; Bakir-Gungor, Burcu; Yousef, MalikThe rapid advancements in high-throughput technologies have led to a dramatic increase in diverse -omics data types, enabling comprehensive analyses, especially for complex diseases like cancer. Despite the development of multi-omics approaches, the challenges of scaling integration to massive, heterogeneous -omics datasets suggest that novel computational tools need to be designed. In this study, we propose an approach for integrating microRNA (miRNA) and messenger RNA (mRNA) expression data, incorporating prior biological knowledge (PBK). This approach scores and ranks groups of miRNAs and their associated genes using cross-validation iterations. The proposed method incorporates a Pattern detection (P) component to identify molecular motifs unique to each biological group. The analysis also facilitates the visualization of the groups, facilitating the identification of co-occurring groups and their characteristic features across iterations. Furthermore, the groups are scored using an over-representation analysis through a new Enrichment (E) component in each iteration. The clusters of the groups based on the Enrichment Scores (ESs) are visualized in a heatmap to obtain novel insights into the collective behavior and dependencies of the groups, aiming to understand the molecular mechanisms of complex diseases. The developed G-S-M-E tool not only provides performance metrics and biological scores at the group level but also offers comprehensive insights into intricate multi-omics interactions. In summary, our study emphasizes the importance of mathematical and data science methodologies in elucidating intricate multi-omics integration, yielding a formalized approach that deepens our comprehension of complex diseases.Article Citation - WoS: 16Citation - Scopus: 21GeNetOntology: Identifying Affected Gene Ontology Terms via Grouping, Scoring, and Modeling of Gene Expression Data Utilizing Biological Knowledge-Based Machine Learning(Frontiers Media S.A., 2023) Ersoz, Nur Sebnem; Bakir-Gungor, Burcu; Yousef, MalikIntroduction: Identifying significant sets of genes that are up/downregulated under specific conditions is vital to understand disease development mechanisms at the molecular level. Along this line, in order to analyze transcriptomic data, several computational feature selection (i.e., gene selection) methods have been proposed. On the other hand, uncovering the core functions of the selected genes provides a deep understanding of diseases. In order to address this problem, biological domain knowledge-based feature selection methods have been proposed. Unlike computational gene selection approaches, these domain knowledge-based methods take the underlying biology into account and integrate knowledge from external biological resources. Gene Ontology (GO) is one such biological resource that provides ontology terms for defining the molecular function, cellular component, and biological process of the gene product.Methods: In this study, we developed a tool named GeNetOntology which performs GO-based feature selection for gene expression data analysis. In the proposed approach, the process of Grouping, Scoring, and Modeling (G-S-M) is used to identify significant GO terms. GO information has been used as the grouping information, which has been embedded into a machine learning (ML) algorithm to select informative ontology terms. The genes annotated with the selected ontology terms have been used in the training part to carry out the classification task of the ML model. The output is an important set of ontologies for the two-class classification task applied to gene expression data for a given phenotype.Results: Our approach has been tested on 11 different gene expression datasets, and the results showed that GeNetOntology successfully identified important disease-related ontology terms to be used in the classification model.Discussion: GeNetOntology will assist geneticists and scientists to identify a range of disease-related genes and ontologies in transcriptomic data analysis, and it will also help doctors design diagnosis platforms and improve patient treatment plans.Master Thesis Gruplama Puanlama Modelleme (G-S-M) ve Geleneksel Özellik Seçim Yaklaşımını Kullanarak İnsan Gastrointestinal Kanser Mikrobiyotalarındaki Potansiyel Taksonomik Biyobelirteçlerin Belirlenmesi(2025) Çanakcımaksutoğlu, Beyza; Güngör, Burcu; Yousef, MalikMikrobiyal bolluk değerlerinin analizi, kanser tahmini için bir potansiyel taşır. Bu çalışma, daha önce paralel olarak incelenmemiş bir alan olan hem doku hem de kan örnekleri kullanarak gastrointestinal (GI) kanser hastaları arasında paylaşılan mikrobiyal biyobelirteçleri belirlemeyi amaçlamaktadır. Bu çalışma, baş ve boyun, yemek borusu, mide, kolon ve kolorektal kanserlere odaklanarak kan ve doku örneklerini analiz etti. Dekontaminasyon adımları gerçekleştirilerek, insan olmayan genetik kodlar işlenerek, tür düzeyinde mikroorganizmalar ve bollukları belirlenerek, kanser hastalarından doku ve kan örnekleri toplayan 'Kanser Genom Atlası'ndan TCMA veri seti oluşturuldu. Geleneksel özellik seçimi algoritmaları (CMIM, mRMR, FCBF, IG, XGB ve SKB) yüksek boyutlu özellik alanını daralttı. Sınıflandırma performansı, 100-kat Monte Carlo çapraz doğrulaması olan bir Random Forest kullanılarak değerlendirildi. Ayrıca, gruplama yöntemi ile özellik boyutunu ve tahmin süresini azaltmak için oluşturulan MicrobiomeGSM modeli, hem kan hem de dokudan türetilen örnekler kullanılarak eğitildi ve MicrobiomeGSM modelinin genelleştirilebilirliği sergilendi. Geleneksel özellik seçimi yöntemleri ve biyolojik veri tabanlı MicrobiomeGSM modellerinin performansları karşılaştırıldı. Gelecekte, ortak biyobelirteç adayları doktorların metastaz olasılığını anlamasına yardımcı olabilir ve tedavi yollarına buna göre karar verilebilir.Doctoral Thesis Hastalık Tahmini ve Biyobelirteçlerin Tespiti için Makine Öğrenim Modellerinin Tasarımı ve Geliştirilmesi(Abdullah Gül Üniversitesi, Fen Bilimleri Enstitüsü, 2024) Temiz, Mustafa; Güngör, Burcu; Yousef, MalikIn medical science, the prediction of diseases and the identification of biomarkers play an important role in the diagnosis and treatment of various health conditions. The recent proliferation of data mining techniques has accelerated the development of disease prediction systems. In particular, machine learning methods are an effective way to analyze medical data and identify patterns to predict the likelihood of the disease development. Machine learning methods also help to identify biomarkers. Recently, the increasing incidence and mortality rates of inflammatory bowel disease, colorectal cancer and type 2 diabetes have drawn researchers' attention to these research areas. The aim of this thesis is to reduce the number of features and improve the prediction performance of machine learning based on complex biological datasets with a large number of disease-related features, as well as to identify potential biomarkers. In this thesis, three different studies are presented. The first study predicts eleven different cancer subgroups using miRNA data and biological domain knowledge and identifies potential biomarkers for these diseases. The second study predicts three different diseases using metagenomic data and biological domain knowledge and identifies potential biomarkers. The third study uses metagenomic data related to colorectal cancer to conduct global and population-based comprehensive experiments with traditional feature selection methods to identify potential biomarkers. This thesis presents a promising avenue for early disease detection, facilitating expedited treatment protocols, improving human survival rates, and potentially alleviating economic burdens within these critical research domains.Conference Object Citation - Scopus: 5Identifying Taxonomic Biomarkers of Colorectal Cancer in Human Intestinal Microbiota Using Multiple Feature Selection Methods(Institute of Electrical and Electronics Engineers Inc., 2022) Jabeer, Amhar; Kocak, Aysegul; Akkaş, Huseyin; Yenisert, Ferhan; Nalbantoĝlu, Özkan Ufuk; Yousef, Malik; Bakir-Güngör, Burcu; Bakir Gungor, BurcuA variety of bacterial species called gut microbiota work together to maintain a steady intestinal environment. The gastrointestinal tract contains tremendous amount of different species including archaea, bacteria, fungi, and viruses. While these organisms are crucial immune system stabilizers, the dysbiosis of the intestinal flora has been related to gastrointestinal disorders including Colorectal cancer (CRC), intestinal cancer, irritable bowel syndrome and inflammatory bowel disease. In the last decade, next-generation sequencing (NGS) methods have accelerated the identification of human gut flora. CRC is a deathly condition that has been on the rise in the last century, affecting half a million people each year. Since early CRC diagnosis is critical for an effective treatment, there is an immediate requirement for a classification system that can expedite CRC diagnosis. In this study, via analyzing the available metagenomics data on CRC, we aim to facilitate the CRC diagnosis via finding biomarkers linked with CRC, and via building a classification model. We have obtained the metagenomic sequencing data of the healthy individuals and CRC patients from a metagenome-wide association analysis and we have classified this data according to the disease stages. Conditional Mutual Information Maximization (CMIM), Fast Correlation Based Filter (FCBF), Extreme Gradient Boosting (XGBoost), min redundancy max relevance (mRMR), Information Gain (IG) and Select K Best (SKB) feature selection algorithms were utilized to cope with the complexity of the features. We observed that the SKB, IG, and XGBoost techniques made significant contributions to decrease the microbiota in use for CRC diagnosis, thereby reducing cost and time. We realized that our Random Forest classifier outperformed Adaboost, Support Vector Machine, Decision Tree, Logitboost and stacking ensemble classifiers in terms of CRC classification performance. Our results reiterated some known and some potential microbiome associated mechanisms in CRC, which could aid the design of new diagnostics based on the microbiome. © 2022 Elsevier B.V., All rights reserved.
- «
- 1 (current)
- 2
- 3
- »

