Building a Challenging Medical Dataset for Comparative Evaluation of Classifier Capabilities

dc.contributor.author Bozkurt, Berat
dc.contributor.author Coskun, Kerem
dc.contributor.author Bakal, Gokhan
dc.date.accessioned 2025-09-25T10:42:01Z
dc.date.available 2025-09-25T10:42:01Z
dc.date.issued 2024
dc.description.abstract Since the 2000s, digitalization has been a crucial transformation in our lives. Nevertheless, digitalization brings a bulk of unstructured textual data to be processed, including articles, clinical records, web pages, and shared social media posts. As a critical analysis, the classification task classifies the given textual entities into correct categories. Categorizing documents from different domains is straightforward since the instances are unlikely to contain similar contexts. However, document classification in a single domain is more complicated due to sharing the same context. Thus, we aim to classify medical articles about four common cancer types (Leukemia, Non-Hodgkin Lymphoma, Bladder Cancer, and Thyroid Cancer) by constructing machine learning and deep learning models. We used 383,914 medical articles about four common cancer types collected by the PubMed API. To build classification models, we split the dataset into 70% as training, 20% as testing, and 10% as validation. We built widely used machine-learning (Logistic Regression, XGBoost, CatBoost, and Random Forest Classifiers) and modern deep-learning (convolutional neural networks - CNN, long short-term memory - LSTM, and gated recurrent unit - GRU) models. We computed the average classification performances (precision, recall, F-score) to evaluate the models over ten distinct dataset splits. The best-performing deep learning model(s) yielded a superior F1 score of 98%. However, traditional machine learning models also achieved reasonably high F1 scores, 95% for the worst-performing case. Ultimately, we constructed multiple models to classify articles, which compose a hard-to-classify dataset in the medical domain. © 2024 Elsevier B.V., All rights reserved. en_US
dc.identifier.doi 10.1016/j.compbiomed.2024.108721
dc.identifier.issn 1879-0534
dc.identifier.issn 0010-4825
dc.identifier.scopus 2-s2.0-85196297908
dc.identifier.uri https://doi.org/10.1016/j.compbiomed.2024.108721
dc.identifier.uri https://hdl.handle.net/20.500.12573/3398
dc.language.iso en en_US
dc.publisher Elsevier Ltd en_US
dc.relation.ispartof Computers in Biology and Medicine en_US
dc.rights info:eu-repo/semantics/closedAccess en_US
dc.subject Classification en_US
dc.subject Deep Learning en_US
dc.subject Machine Learning en_US
dc.subject Text Mining en_US
dc.subject Convolutional Neural Networks en_US
dc.subject Diseases en_US
dc.subject Information Retrieval Systems en_US
dc.subject Learning Systems en_US
dc.subject Logistic Regression en_US
dc.subject Long Short-Term Memory en_US
dc.subject Statistical Tests en_US
dc.subject Text Processing en_US
dc.subject Websites en_US
dc.subject Clinical Records en_US
dc.subject Comparative Evaluations en_US
dc.subject Deep Learning en_US
dc.subject F1 Scores en_US
dc.subject Learning Models en_US
dc.subject Machine-Learning en_US
dc.subject Medical Data Sets en_US
dc.subject Text-Mining en_US
dc.subject Textual Data en_US
dc.subject Web-Page en_US
dc.subject Classification (Of Information) en_US
dc.subject Article en_US
dc.subject Bladder Cancer en_US
dc.subject Classifier en_US
dc.subject Convolutional Neural Network en_US
dc.subject Deep Learning en_US
dc.subject Leukemia en_US
dc.subject Logistic Regression Analysis en_US
dc.subject Machine Learning en_US
dc.subject Non-Hodgkin Lymphoma en_US
dc.subject Random Forest en_US
dc.subject Short Term Memory en_US
dc.subject Social Media en_US
dc.subject Thyroid Cancer en_US
dc.subject Artificial Neural Network en_US
dc.subject Classification en_US
dc.subject Factual Database en_US
dc.subject Human en_US
dc.subject Neoplasm en_US
dc.subject Databases, Factual en_US
dc.subject Deep Learning en_US
dc.subject Humans en_US
dc.subject Machine Learning en_US
dc.subject Neoplasms en_US
dc.subject Neural Networks, Computer en_US
dc.title Building a Challenging Medical Dataset for Comparative Evaluation of Classifier Capabilities en_US
dc.type Article en_US
dspace.entity.type Publication
gdc.author.scopusid 59634797900
gdc.author.scopusid 59177226100
gdc.author.scopusid 57074041500
gdc.bip.impulseclass C4
gdc.bip.influenceclass C5
gdc.bip.popularityclass C4
gdc.coar.access metadata only access
gdc.coar.type text::journal::journal article
gdc.collaboration.industrial false
gdc.description.department Abdullah Gül University en_US
gdc.description.departmenttemp [Bozkurt] Berat, Department of Computer Engineering, Abdullah Gül Üniversitesi, Kayseri, Turkey; [Coskun] Kerem, Department of Computer Engineering, Abdullah Gül Üniversitesi, Kayseri, Turkey; [Bakal] Gokhan, Department of Computer Engineering, Abdullah Gül Üniversitesi, Kayseri, Turkey en_US
gdc.description.publicationcategory Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı en_US
gdc.description.scopusquality Q1
gdc.description.startpage 108721
gdc.description.volume 178 en_US
gdc.description.wosquality Q1
gdc.identifier.openalex W4399800545
gdc.identifier.pmid 38901188
gdc.index.type Scopus
gdc.index.type PubMed
gdc.oaire.diamondjournal false
gdc.oaire.impulse 6.0
gdc.oaire.influence 2.974093E-9
gdc.oaire.isgreen false
gdc.oaire.keywords Machine Learning
gdc.oaire.keywords Deep Learning
gdc.oaire.keywords Databases, Factual
gdc.oaire.keywords Neoplasms
gdc.oaire.keywords Humans
gdc.oaire.keywords Neural Networks, Computer
gdc.oaire.popularity 7.03532E-9
gdc.oaire.publicfunded false
gdc.openalex.collaboration National
gdc.openalex.fwci 10.07365439
gdc.openalex.normalizedpercentile 0.96
gdc.openalex.toppercent TOP 10%
gdc.opencitations.count 0
gdc.plumx.mendeley 14
gdc.plumx.pubmedcites 1
gdc.plumx.scopuscites 5
gdc.scopus.citedcount 6
gdc.virtual.author Bakal, Mehmet Gökhan
relation.isAuthorOfPublication 53ed538c-20d9-45c8-af59-7fa4d1b90cf7
relation.isAuthorOfPublication.latestForDiscovery 53ed538c-20d9-45c8-af59-7fa4d1b90cf7
relation.isOrgUnitOfPublication 665d3039-05f8-4a25-9a3c-b9550bffecef
relation.isOrgUnitOfPublication 52f507ab-f278-4a1f-824c-44da2a86bd51
relation.isOrgUnitOfPublication ef13a800-4c99-4124-81e0-3e25b33c0c2b
relation.isOrgUnitOfPublication.latestForDiscovery 665d3039-05f8-4a25-9a3c-b9550bffecef

Files