Sample Reduction Strategies for Protein Secondary Structure Prediction

Atasever, Sema; Aydın, Zafer; Erbay, Hasan; Sabzekar, Mostafa

Sample Reduction Strategies for Protein Secondary Structure Prediction

dc.contributor.author	Atasever, Sema
dc.contributor.author	Aydın, Zafer
dc.contributor.author	Erbay, Hasan
dc.contributor.author	Sabzekar, Mostafa
dc.contributor.department	AGÜ, Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü	en_US
dc.contributor.institutionauthor
dc.date.accessioned	2020-01-31T13:43:37Z
dc.date.available	2020-01-31T13:43:37Z
dc.date.issued	2019	en_US
dc.description	This work was supported by 3501 TUBITAK National Young Researches Career Award [grant number 113E550].	en_US
dc.description.abstract	Predicting the secondary structure from protein sequence plays a crucial role in estimating the 3D structure, which has applications in drug design and in understanding the function of proteins. As new genes and proteins are discovered, the large size of the protein databases and datasets that can be used for training prediction models grows considerably. A two-stage hybrid classifier, which employs dynamic Bayesian networks and a support vector machine (SVM) has been shown to provide state-of-the-art prediction accuracy for protein secondary structure prediction. However, SVM is not efficient for large datasets due to the quadratic optimization involved in model training. In this paper, two techniques are implemented on CB513 benchmark for reducing the number of samples in the train set of the SVM. The first method randomly selects a fraction of data samples from the train set using a stratified selection strategy. This approach can remove approximately 50% of the data samples from the train set and reduce the model training time by 73.38% on average without decreasing the prediction accuracy significantly. The second method clusters the data samples by a hierarchical clustering algorithm and replaces the train set samples with nearest neighbors of the cluster centers in order to improve the training time. To cluster the feature vectors, the hierarchical clustering method is implemented, for which the number of clusters and the number of nearest neighbors are optimized as hyper-parameters by computing the prediction accuracy on validation sets. It is found that clustering can reduce the size of the train set by 26% without reducing the prediction accuracy. Among the clustering techniques Ward's method provided the best accuracy on test data. Keywords	en_US
dc.description.sponsorship	3501 TUBITAK National Young Researches Career Award 113E550	en_US
dc.identifier.citation	39	en_US
dc.identifier.doi	10.3390/app9204429
dc.identifier.issn	2076-3417
dc.identifier.other	10.3390/app9204429
dc.identifier.uri	https://hdl.handle.net/20.500.12573/89
dc.language.iso	eng	en_US
dc.publisher	MDPI, ST ALBAN-ANLAGE 66, CH-4052 BASEL, SWITZERLAND	en_US
dc.relation.ispartofseries	9;
dc.relation.publicationcategory	Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı	en_US
dc.rights	info:eu-repo/semantics/openAccess	en_US
dc.subject	protein secondary structure prediction	en_US
dc.subject	support vector machine	en_US
dc.subject	bayesian network	en_US
dc.subject	stratified sampling	en_US
dc.subject	hierarchical clustering	en_US
dc.title	Sample Reduction Strategies for Protein Secondary Structure Prediction	en_US
dc.type	article	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Sample Reduction Strategies for Protein Secondary Structure Prediction.pdf
Size:: 688.17 KB
Format:: Adobe Portable Document Format
Description:: Makale

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Bilgisayar Mühendisliği Bölümü Koleksiyonu
Scopus İndeksli Yayınlar Koleksiyonu
WoS İndeksli Yayınlar Koleksiyonu