Building a Challenging Medical Dataset for Comparative Evaluation of Classifier Capabilities

No Thumbnail Available

Date

2024

Journal Title

Journal ISSN

Volume Title

Publisher

Elsevier Ltd

Open Access Color

Green Open Access

No

OpenAIRE Downloads

OpenAIRE Views

Publicly Funded

No
Impulse
Top 10%
Influence
Average
Popularity
Top 10%

Research Projects

Journal Issue

Abstract

Since the 2000s, digitalization has been a crucial transformation in our lives. Nevertheless, digitalization brings a bulk of unstructured textual data to be processed, including articles, clinical records, web pages, and shared social media posts. As a critical analysis, the classification task classifies the given textual entities into correct categories. Categorizing documents from different domains is straightforward since the instances are unlikely to contain similar contexts. However, document classification in a single domain is more complicated due to sharing the same context. Thus, we aim to classify medical articles about four common cancer types (Leukemia, Non-Hodgkin Lymphoma, Bladder Cancer, and Thyroid Cancer) by constructing machine learning and deep learning models. We used 383,914 medical articles about four common cancer types collected by the PubMed API. To build classification models, we split the dataset into 70% as training, 20% as testing, and 10% as validation. We built widely used machine-learning (Logistic Regression, XGBoost, CatBoost, and Random Forest Classifiers) and modern deep-learning (convolutional neural networks - CNN, long short-term memory - LSTM, and gated recurrent unit - GRU) models. We computed the average classification performances (precision, recall, F-score) to evaluate the models over ten distinct dataset splits. The best-performing deep learning model(s) yielded a superior F1 score of 98%. However, traditional machine learning models also achieved reasonably high F1 scores, 95% for the worst-performing case. Ultimately, we constructed multiple models to classify articles, which compose a hard-to-classify dataset in the medical domain. © 2024 Elsevier B.V., All rights reserved.

Description

Keywords

Classification, Deep Learning, Machine Learning, Text Mining, Convolutional Neural Networks, Diseases, Information Retrieval Systems, Learning Systems, Logistic Regression, Long Short-Term Memory, Statistical Tests, Text Processing, Websites, Clinical Records, Comparative Evaluations, Deep Learning, F1 Scores, Learning Models, Machine-Learning, Medical Data Sets, Text-Mining, Textual Data, Web-Page, Classification (Of Information), Article, Bladder Cancer, Classifier, Convolutional Neural Network, Deep Learning, Leukemia, Logistic Regression Analysis, Machine Learning, Non-Hodgkin Lymphoma, Random Forest, Short Term Memory, Social Media, Thyroid Cancer, Artificial Neural Network, Classification, Factual Database, Human, Neoplasm, Databases, Factual, Deep Learning, Humans, Machine Learning, Neoplasms, Neural Networks, Computer, Machine Learning, Deep Learning, Databases, Factual, Neoplasms, Humans, Neural Networks, Computer

Turkish CoHE Thesis Center URL

Fields of Science

Citation

WoS Q

Q1

Scopus Q

Q1
OpenCitations Logo
OpenCitations Citation Count
N/A

Source

Computers in Biology and Medicine

Volume

178

Issue

Start Page

108721

End Page

PlumX Metrics
Citations

Scopus : 5

PubMed : 1

Captures

Mendeley Readers : 14

SCOPUS™ Citations

5

checked on Feb 03, 2026

Google Scholar Logo
Google Scholar™
OpenAlex Logo
OpenAlex FWCI
10.07365439

Sustainable Development Goals

3

GOOD HEALTH AND WELL-BEING
GOOD HEALTH AND WELL-BEING Logo

4

QUALITY EDUCATION
QUALITY EDUCATION Logo

5

GENDER EQUALITY
GENDER EQUALITY Logo

8

DECENT WORK AND ECONOMIC GROWTH
DECENT WORK AND ECONOMIC GROWTH Logo

10

REDUCED INEQUALITIES
REDUCED INEQUALITIES Logo

11

SUSTAINABLE CITIES AND COMMUNITIES
SUSTAINABLE CITIES AND COMMUNITIES Logo

17

PARTNERSHIPS FOR THE GOALS
PARTNERSHIPS FOR THE GOALS Logo