Sample Reduction Strategies for Protein Secondary Structure Prediction

dc.contributor.author Atasever, Sema
dc.contributor.author Aydin, Zafer
dc.contributor.author Erbay, Hasan
dc.contributor.author Sabzekar, Mostafa
dc.date.accessioned 2025-09-25T10:56:47Z
dc.date.available 2025-09-25T10:56:47Z
dc.date.issued 2019
dc.description Erbay, Hasan/0000-0002-7555-541X; Atasever, Sema/0000-0002-2295-7917; Sabzekar, Mostafa/0000-0002-6886-1240; Aydin, Zafer/0000-0001-7686-6298 en_US
dc.description.abstract Predicting the secondary structure from protein sequence plays a crucial role in estimating the 3D structure, which has applications in drug design and in understanding the function of proteins. As new genes and proteins are discovered, the large size of the protein databases and datasets that can be used for training prediction models grows considerably. A two-stage hybrid classifier, which employs dynamic Bayesian networks and a support vector machine (SVM) has been shown to provide state-of-the-art prediction accuracy for protein secondary structure prediction. However, SVM is not efficient for large datasets due to the quadratic optimization involved in model training. In this paper, two techniques are implemented on CB513 benchmark for reducing the number of samples in the train set of the SVM. The first method randomly selects a fraction of data samples from the train set using a stratified selection strategy. This approach can remove approximately 50% of the data samples from the train set and reduce the model training time by 73.38% on average without decreasing the prediction accuracy significantly. The second method clusters the data samples by a hierarchical clustering algorithm and replaces the train set samples with nearest neighbors of the cluster centers in order to improve the training time. To cluster the feature vectors, the hierarchical clustering method is implemented, for which the number of clusters and the number of nearest neighbors are optimized as hyper-parameters by computing the prediction accuracy on validation sets. It is found that clustering can reduce the size of the train set by 26% without reducing the prediction accuracy. Among the clustering techniques Ward's method provided the best accuracy on test data. en_US
dc.description.sponsorship 3501 TUBITAK National Young Researches Career Award [113E550] en_US
dc.description.sponsorship This work was supported by 3501 TUBITAK National Young Researches Career Award [grant number 113E550]. en_US
dc.identifier.doi 10.3390/app9204429
dc.identifier.issn 2076-3417
dc.identifier.scopus 2-s2.0-85074210330
dc.identifier.uri https://doi.org/10.3390/app9204429
dc.identifier.uri https://hdl.handle.net/20.500.12573/4606
dc.language.iso en en_US
dc.publisher MDPI en_US
dc.relation.ispartof Applied Sciences-Basel en_US
dc.rights info:eu-repo/semantics/openAccess en_US
dc.subject Protein Secondary Structure Prediction en_US
dc.subject Support Vector Machine en_US
dc.subject Bayesian Network en_US
dc.subject Stratified Sampling en_US
dc.subject Hierarchical Clustering en_US
dc.title Sample Reduction Strategies for Protein Secondary Structure Prediction en_US
dc.type Article en_US
dspace.entity.type Publication
gdc.author.id Erbay, Hasan/0000-0002-7555-541X
gdc.author.id Atasever, Sema/0000-0002-2295-7917
gdc.author.id Sabzekar, Mostafa/0000-0002-6886-1240
gdc.author.id Aydin, Zafer/0000-0001-7686-6298
gdc.author.scopusid 57211503467
gdc.author.scopusid 7003852510
gdc.author.scopusid 55900695500
gdc.author.scopusid 35796344600
gdc.author.wosid Sabzekar, Mostafa/Aad-7807-2020
gdc.author.wosid Erbay, Hasan/F-1093-2016
gdc.bip.impulseclass C5
gdc.bip.influenceclass C5
gdc.bip.popularityclass C4
gdc.coar.access open access
gdc.coar.type text::journal::journal article
gdc.collaboration.industrial false
gdc.description.department Abdullah Gül University en_US
gdc.description.departmenttemp [Atasever, Sema] Nevsehir Haci Bektas Veli Univ, Dept Comp Engn, TR-50300 Nevsehir, Turkey; [Aydin, Zafer] Abdullah Gul Univ, Dept Comp Engn, TR-38080 Kayseri, Turkey; [Erbay, Hasan] Univ Turkish Aeronaut Assoc, Engn Fac, Dept Comp Engn, TR-06790 Ankara, Turkey; [Sabzekar, Mostafa] Birjand Univ Technol, Dept Comp Engn, Birjand 97175569, Iran en_US
gdc.description.issue 20 en_US
gdc.description.publicationcategory Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı en_US
gdc.description.scopusquality Q2
gdc.description.startpage 4429
gdc.description.volume 9 en_US
gdc.description.woscitationindex Science Citation Index Expanded
gdc.description.wosquality Q2
gdc.identifier.openalex W2980675127
gdc.identifier.wos WOS:000496269400232
gdc.index.type WoS
gdc.index.type Scopus
gdc.oaire.accesstype GOLD
gdc.oaire.diamondjournal false
gdc.oaire.downloads 68
gdc.oaire.impulse 3.0
gdc.oaire.influence 2.6916624E-9
gdc.oaire.isgreen true
gdc.oaire.keywords Technology
gdc.oaire.keywords QH301-705.5
gdc.oaire.keywords T
gdc.oaire.keywords Physics
gdc.oaire.keywords QC1-999
gdc.oaire.keywords stratified sampling
gdc.oaire.keywords Engineering (General). Civil engineering (General)
gdc.oaire.keywords Chemistry
gdc.oaire.keywords bayesian network
gdc.oaire.keywords support vector machine
gdc.oaire.keywords protein secondary structure prediction
gdc.oaire.keywords TA1-2040
gdc.oaire.keywords Biology (General)
gdc.oaire.keywords hierarchical clustering
gdc.oaire.keywords QD1-999
gdc.oaire.popularity 4.393328E-9
gdc.oaire.publicfunded false
gdc.oaire.sciencefields 0301 basic medicine
gdc.oaire.sciencefields 0303 health sciences
gdc.oaire.sciencefields 03 medical and health sciences
gdc.oaire.views 91
gdc.openalex.collaboration International
gdc.openalex.fwci 0.30915302
gdc.openalex.normalizedpercentile 0.56
gdc.opencitations.count 3
gdc.plumx.crossrefcites 4
gdc.plumx.mendeley 9
gdc.plumx.patentfamcites 1
gdc.plumx.scopuscites 4
gdc.scopus.citedcount 4
gdc.virtual.author Aydın, Zafer
gdc.wos.citedcount 4
relation.isAuthorOfPublication a26c06af-eae3-407c-a21a-128459fa4d2f
relation.isAuthorOfPublication.latestForDiscovery a26c06af-eae3-407c-a21a-128459fa4d2f
relation.isOrgUnitOfPublication 665d3039-05f8-4a25-9a3c-b9550bffecef
relation.isOrgUnitOfPublication 52f507ab-f278-4a1f-824c-44da2a86bd51
relation.isOrgUnitOfPublication ef13a800-4c99-4124-81e0-3e25b33c0c2b
relation.isOrgUnitOfPublication.latestForDiscovery 665d3039-05f8-4a25-9a3c-b9550bffecef

Files