Sample Reduction Strategies for Protein Secondary Structure Prediction

No Thumbnail Available

Date

2019

Journal Title

Journal ISSN

Volume Title

Publisher

MDPI

Open Access Color

GOLD

Green Open Access

Yes

OpenAIRE Downloads

68

OpenAIRE Views

91

Publicly Funded

No
Impulse
Average
Influence
Average
Popularity
Top 10%

Research Projects

Journal Issue

Abstract

Predicting the secondary structure from protein sequence plays a crucial role in estimating the 3D structure, which has applications in drug design and in understanding the function of proteins. As new genes and proteins are discovered, the large size of the protein databases and datasets that can be used for training prediction models grows considerably. A two-stage hybrid classifier, which employs dynamic Bayesian networks and a support vector machine (SVM) has been shown to provide state-of-the-art prediction accuracy for protein secondary structure prediction. However, SVM is not efficient for large datasets due to the quadratic optimization involved in model training. In this paper, two techniques are implemented on CB513 benchmark for reducing the number of samples in the train set of the SVM. The first method randomly selects a fraction of data samples from the train set using a stratified selection strategy. This approach can remove approximately 50% of the data samples from the train set and reduce the model training time by 73.38% on average without decreasing the prediction accuracy significantly. The second method clusters the data samples by a hierarchical clustering algorithm and replaces the train set samples with nearest neighbors of the cluster centers in order to improve the training time. To cluster the feature vectors, the hierarchical clustering method is implemented, for which the number of clusters and the number of nearest neighbors are optimized as hyper-parameters by computing the prediction accuracy on validation sets. It is found that clustering can reduce the size of the train set by 26% without reducing the prediction accuracy. Among the clustering techniques Ward's method provided the best accuracy on test data.

Description

Erbay, Hasan/0000-0002-7555-541X; Atasever, Sema/0000-0002-2295-7917; Sabzekar, Mostafa/0000-0002-6886-1240; Aydin, Zafer/0000-0001-7686-6298

Keywords

Protein Secondary Structure Prediction, Support Vector Machine, Bayesian Network, Stratified Sampling, Hierarchical Clustering, Technology, QH301-705.5, T, Physics, QC1-999, stratified sampling, Engineering (General). Civil engineering (General), Chemistry, bayesian network, support vector machine, protein secondary structure prediction, TA1-2040, Biology (General), hierarchical clustering, QD1-999

Turkish CoHE Thesis Center URL

Fields of Science

0301 basic medicine, 0303 health sciences, 03 medical and health sciences

Citation

WoS Q

Q2

Scopus Q

Q2
OpenCitations Logo
OpenCitations Citation Count
3

Source

Applied Sciences-Basel

Volume

9

Issue

20

Start Page

4429

End Page

PlumX Metrics
Citations

CrossRef : 4

Scopus : 4

Patent Family : 1

Captures

Mendeley Readers : 9

SCOPUS™ Citations

4

checked on Feb 03, 2026

Web of Science™ Citations

4

checked on Feb 03, 2026

Page Views

5

checked on Feb 03, 2026

Google Scholar Logo
Google Scholar™
OpenAlex Logo
OpenAlex FWCI
0.30915302
Altmetrics Badge

Sustainable Development Goals

SDG data is not available