Designing and Modelling Selective Replication for Fault-Tolerant HPC Applications

dc.contributor.author Subasi, Omer
dc.contributor.author Yalcin, Gulay
dc.contributor.author Zyulkyarov, Ferad
dc.contributor.author Unsal, Osman
dc.contributor.author Labarta, Jesus
dc.date.accessioned 2025-09-25T10:44:26Z
dc.date.available 2025-09-25T10:44:26Z
dc.date.issued 2017
dc.description Subasi, Omer/0000-0002-5373-7570 en_US
dc.description.abstract Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for High Performance Computing (HPC) applications. There are studies that address fail-stop errors and studies that address SDCs. However few studies address both types of errors together. In this paper we propose a software-based selective replication technique for HPC applications for both fail-stop errors and SDCs. Since complete replication of applications can be costly in terms of resources, we develop a runtime-based technique for selective replication. Selective replication provides an opportunity to meet HPC reliability targets while decreasing resource costs. Our technique is low-overhead, automatic and completely transparent to the user. en_US
dc.description.sponsorship European Union Mont-blanc 2 Project [610402]; FEDER funds [TIN2015-65316-P] en_US
dc.description.sponsorship This work is supported in part by the European Union Mont-blanc 2 Project (www.montblanc-project.eu), grant agreement no. 610402 and the FEDER funds under contract TIN2015-65316-P. en_US
dc.identifier.doi 10.1109/CCGRID.2017.40
dc.identifier.isbn 9781509066117
dc.identifier.issn 2376-4414
dc.identifier.scopus 2-s2.0-85027467982
dc.identifier.uri https://doi.org/10.1109/CCGRID.2017.40
dc.identifier.uri https://hdl.handle.net/20.500.12573/3595
dc.language.iso en en_US
dc.publisher IEEE en_US
dc.relation.ispartof 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) -- MAY 14-17, 2017 -- Madrid, SPAIN en_US
dc.relation.ispartofseries IEEE-ACM International Symposium on Cluster Cloud and Grid Computing
dc.rights info:eu-repo/semantics/openAccess en_US
dc.title Designing and Modelling Selective Replication for Fault-Tolerant HPC Applications en_US
dc.type Conference Object en_US
dspace.entity.type Publication
gdc.author.id Subasi, Omer/0000-0002-5373-7570
gdc.author.scopusid 57144377900
gdc.author.scopusid 23029394200
gdc.author.scopusid 6505657882
gdc.author.scopusid 35612224700
gdc.author.scopusid 56256013400
gdc.author.wosid Unsal, Osman/B-9161-2016
gdc.bip.impulseclass C4
gdc.bip.influenceclass C4
gdc.bip.popularityclass C4
gdc.coar.access open access
gdc.coar.type text::conference output
gdc.collaboration.industrial false
gdc.description.department Abdullah Gül University en_US
gdc.description.departmenttemp [Subasi, Omer; Zyulkyarov, Ferad; Unsal, Osman; Labarta, Jesus] Barcelona Supercomp Ctr, Barcelona, Spain; [Subasi, Omer; Labarta, Jesus] Univ Politecn Cataluna, Barcelona, Spain; [Yalcin, Gulay] Abdullah Gul Univ, Kayseri, Turkey en_US
gdc.description.endpage 457 en_US
gdc.description.publicationcategory Konferans Öğesi - Uluslararası - Kurum Öğretim Elemanı en_US
gdc.description.scopusquality N/A
gdc.description.startpage 452 en_US
gdc.description.woscitationindex Conference Proceedings Citation Index - Science
gdc.description.wosquality N/A
gdc.identifier.openalex W2725418265
gdc.identifier.wos WOS:000426912900048
gdc.index.type WoS
gdc.index.type Scopus
gdc.oaire.diamondjournal false
gdc.oaire.downloads 81
gdc.oaire.impulse 14.0
gdc.oaire.influence 3.6500456E-9
gdc.oaire.isgreen true
gdc.oaire.keywords Tolerància als errors (Informàtica)
gdc.oaire.keywords Parallel processing (Electronic computers)
gdc.oaire.keywords Markov processes
gdc.oaire.keywords Processament en paral·lel (Ordinadors)
gdc.oaire.keywords Computational modeling
gdc.oaire.keywords Computer crashes
gdc.oaire.keywords Reliability theory
gdc.oaire.keywords Fault-tolerant computing
gdc.oaire.keywords Mathematical model
gdc.oaire.keywords Hardware
gdc.oaire.keywords Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors
gdc.oaire.keywords :Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC]
gdc.oaire.popularity 7.534923E-9
gdc.oaire.publicfunded false
gdc.oaire.sciencefields 02 engineering and technology
gdc.oaire.sciencefields 0202 electrical engineering, electronic engineering, information engineering
gdc.oaire.views 28
gdc.openalex.collaboration International
gdc.openalex.fwci 5.188891
gdc.openalex.normalizedpercentile 0.95
gdc.openalex.toppercent TOP 10%
gdc.opencitations.count 22
gdc.plumx.crossrefcites 7
gdc.plumx.mendeley 14
gdc.plumx.scopuscites 25
gdc.scopus.citedcount 25
gdc.virtual.author Yalçın Alkan, Gülay
gdc.wos.citedcount 21
relation.isAuthorOfPublication e0dc9e40-f936-402f-96c6-f4e668a0b9d3
relation.isAuthorOfPublication.latestForDiscovery e0dc9e40-f936-402f-96c6-f4e668a0b9d3
relation.isOrgUnitOfPublication 665d3039-05f8-4a25-9a3c-b9550bffecef
relation.isOrgUnitOfPublication 52f507ab-f278-4a1f-824c-44da2a86bd51
relation.isOrgUnitOfPublication ef13a800-4c99-4124-81e0-3e25b33c0c2b
relation.isOrgUnitOfPublication.latestForDiscovery 665d3039-05f8-4a25-9a3c-b9550bffecef

Files