Sample Reduction Strategies for Protein Secondary Structure Prediction

dc.contributor.author Atasever, Sema
dc.contributor.author Aydın, Zafer
dc.contributor.author Erbay, Hasan
dc.contributor.author Sabzekar, Mostafa
dc.contributor.department AGÜ, Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü en_US
dc.contributor.institutionauthor
dc.date.accessioned 2020-01-31T13:43:37Z
dc.date.available 2020-01-31T13:43:37Z
dc.date.issued 2019 en_US
dc.description This work was supported by 3501 TUBITAK National Young Researches Career Award [grant number 113E550]. en_US
dc.description.abstract Predicting the secondary structure from protein sequence plays a crucial role in estimating the 3D structure, which has applications in drug design and in understanding the function of proteins. As new genes and proteins are discovered, the large size of the protein databases and datasets that can be used for training prediction models grows considerably. A two-stage hybrid classifier, which employs dynamic Bayesian networks and a support vector machine (SVM) has been shown to provide state-of-the-art prediction accuracy for protein secondary structure prediction. However, SVM is not efficient for large datasets due to the quadratic optimization involved in model training. In this paper, two techniques are implemented on CB513 benchmark for reducing the number of samples in the train set of the SVM. The first method randomly selects a fraction of data samples from the train set using a stratified selection strategy. This approach can remove approximately 50% of the data samples from the train set and reduce the model training time by 73.38% on average without decreasing the prediction accuracy significantly. The second method clusters the data samples by a hierarchical clustering algorithm and replaces the train set samples with nearest neighbors of the cluster centers in order to improve the training time. To cluster the feature vectors, the hierarchical clustering method is implemented, for which the number of clusters and the number of nearest neighbors are optimized as hyper-parameters by computing the prediction accuracy on validation sets. It is found that clustering can reduce the size of the train set by 26% without reducing the prediction accuracy. Among the clustering techniques Ward's method provided the best accuracy on test data. Keywords en_US
dc.description.sponsorship 3501 TUBITAK National Young Researches Career Award 113E550 en_US
dc.identifier.citation 39 en_US
dc.identifier.doi 10.3390/app9204429
dc.identifier.issn 2076-3417
dc.identifier.other 10.3390/app9204429
dc.identifier.uri https://hdl.handle.net/20.500.12573/89
dc.language.iso eng en_US
dc.publisher MDPI, ST ALBAN-ANLAGE 66, CH-4052 BASEL, SWITZERLAND en_US
dc.relation.ispartofseries 9;
dc.relation.publicationcategory Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı en_US
dc.rights info:eu-repo/semantics/openAccess en_US
dc.subject protein secondary structure prediction en_US
dc.subject support vector machine en_US
dc.subject bayesian network en_US
dc.subject stratified sampling en_US
dc.subject hierarchical clustering en_US
dc.title Sample Reduction Strategies for Protein Secondary Structure Prediction en_US
dc.type article en_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Sample Reduction Strategies for Protein Secondary Structure Prediction.pdf
Size:
688.17 KB
Format:
Adobe Portable Document Format
Description:
Makale

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: