Designing and Modelling Selective Replication for Fault-Tolerant HPC Applications
| dc.contributor.author | Subasi, Omer | |
| dc.contributor.author | Yalcin, Gulay | |
| dc.contributor.author | Zyulkyarov, Ferad | |
| dc.contributor.author | Unsal, Osman | |
| dc.contributor.author | Labarta, Jesus | |
| dc.date.accessioned | 2025-09-25T10:44:26Z | |
| dc.date.available | 2025-09-25T10:44:26Z | |
| dc.date.issued | 2017 | |
| dc.description | Subasi, Omer/0000-0002-5373-7570 | en_US |
| dc.description.abstract | Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for High Performance Computing (HPC) applications. There are studies that address fail-stop errors and studies that address SDCs. However few studies address both types of errors together. In this paper we propose a software-based selective replication technique for HPC applications for both fail-stop errors and SDCs. Since complete replication of applications can be costly in terms of resources, we develop a runtime-based technique for selective replication. Selective replication provides an opportunity to meet HPC reliability targets while decreasing resource costs. Our technique is low-overhead, automatic and completely transparent to the user. | en_US |
| dc.description.sponsorship | European Union Mont-blanc 2 Project [610402]; FEDER funds [TIN2015-65316-P] | en_US |
| dc.description.sponsorship | This work is supported in part by the European Union Mont-blanc 2 Project (www.montblanc-project.eu), grant agreement no. 610402 and the FEDER funds under contract TIN2015-65316-P. | en_US |
| dc.identifier.doi | 10.1109/CCGRID.2017.40 | |
| dc.identifier.isbn | 9781509066117 | |
| dc.identifier.issn | 2376-4414 | |
| dc.identifier.scopus | 2-s2.0-85027467982 | |
| dc.identifier.uri | https://doi.org/10.1109/CCGRID.2017.40 | |
| dc.identifier.uri | https://hdl.handle.net/20.500.12573/3595 | |
| dc.language.iso | en | en_US |
| dc.publisher | IEEE | en_US |
| dc.relation.ispartof | 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) -- MAY 14-17, 2017 -- Madrid, SPAIN | en_US |
| dc.relation.ispartofseries | IEEE-ACM International Symposium on Cluster Cloud and Grid Computing | |
| dc.rights | info:eu-repo/semantics/openAccess | en_US |
| dc.title | Designing and Modelling Selective Replication for Fault-Tolerant HPC Applications | en_US |
| dc.type | Conference Object | en_US |
| dspace.entity.type | Publication | |
| gdc.author.id | Subasi, Omer/0000-0002-5373-7570 | |
| gdc.author.scopusid | 57144377900 | |
| gdc.author.scopusid | 23029394200 | |
| gdc.author.scopusid | 6505657882 | |
| gdc.author.scopusid | 35612224700 | |
| gdc.author.scopusid | 56256013400 | |
| gdc.author.wosid | Unsal, Osman/B-9161-2016 | |
| gdc.bip.impulseclass | C4 | |
| gdc.bip.influenceclass | C4 | |
| gdc.bip.popularityclass | C4 | |
| gdc.coar.access | open access | |
| gdc.coar.type | text::conference output | |
| gdc.collaboration.industrial | false | |
| gdc.description.department | Abdullah Gül University | en_US |
| gdc.description.departmenttemp | [Subasi, Omer; Zyulkyarov, Ferad; Unsal, Osman; Labarta, Jesus] Barcelona Supercomp Ctr, Barcelona, Spain; [Subasi, Omer; Labarta, Jesus] Univ Politecn Cataluna, Barcelona, Spain; [Yalcin, Gulay] Abdullah Gul Univ, Kayseri, Turkey | en_US |
| gdc.description.endpage | 457 | en_US |
| gdc.description.publicationcategory | Konferans Öğesi - Uluslararası - Kurum Öğretim Elemanı | en_US |
| gdc.description.scopusquality | N/A | |
| gdc.description.startpage | 452 | en_US |
| gdc.description.woscitationindex | Conference Proceedings Citation Index - Science | |
| gdc.description.wosquality | N/A | |
| gdc.identifier.openalex | W2725418265 | |
| gdc.identifier.wos | WOS:000426912900048 | |
| gdc.index.type | WoS | |
| gdc.index.type | Scopus | |
| gdc.oaire.diamondjournal | false | |
| gdc.oaire.downloads | 81 | |
| gdc.oaire.impulse | 14.0 | |
| gdc.oaire.influence | 3.6500456E-9 | |
| gdc.oaire.isgreen | true | |
| gdc.oaire.keywords | Tolerància als errors (Informàtica) | |
| gdc.oaire.keywords | Parallel processing (Electronic computers) | |
| gdc.oaire.keywords | Markov processes | |
| gdc.oaire.keywords | Processament en paral·lel (Ordinadors) | |
| gdc.oaire.keywords | Computational modeling | |
| gdc.oaire.keywords | Computer crashes | |
| gdc.oaire.keywords | Reliability theory | |
| gdc.oaire.keywords | Fault-tolerant computing | |
| gdc.oaire.keywords | Mathematical model | |
| gdc.oaire.keywords | Hardware | |
| gdc.oaire.keywords | Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors | |
| gdc.oaire.keywords | :Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC] | |
| gdc.oaire.popularity | 7.534923E-9 | |
| gdc.oaire.publicfunded | false | |
| gdc.oaire.sciencefields | 02 engineering and technology | |
| gdc.oaire.sciencefields | 0202 electrical engineering, electronic engineering, information engineering | |
| gdc.oaire.views | 28 | |
| gdc.openalex.collaboration | International | |
| gdc.openalex.fwci | 5.188891 | |
| gdc.openalex.normalizedpercentile | 0.95 | |
| gdc.openalex.toppercent | TOP 10% | |
| gdc.opencitations.count | 22 | |
| gdc.plumx.crossrefcites | 7 | |
| gdc.plumx.mendeley | 14 | |
| gdc.plumx.scopuscites | 25 | |
| gdc.scopus.citedcount | 25 | |
| gdc.virtual.author | Yalçın Alkan, Gülay | |
| gdc.wos.citedcount | 21 | |
| relation.isAuthorOfPublication | e0dc9e40-f936-402f-96c6-f4e668a0b9d3 | |
| relation.isAuthorOfPublication.latestForDiscovery | e0dc9e40-f936-402f-96c6-f4e668a0b9d3 | |
| relation.isOrgUnitOfPublication | 665d3039-05f8-4a25-9a3c-b9550bffecef | |
| relation.isOrgUnitOfPublication | 52f507ab-f278-4a1f-824c-44da2a86bd51 | |
| relation.isOrgUnitOfPublication | ef13a800-4c99-4124-81e0-3e25b33c0c2b | |
| relation.isOrgUnitOfPublication.latestForDiscovery | 665d3039-05f8-4a25-9a3c-b9550bffecef |
