Designing and Modelling Selective Replication for Fault-Tolerant HPC Applications
Loading...
Date
2017
Journal Title
Journal ISSN
Volume Title
Publisher
IEEE
Open Access Color
Green Open Access
Yes
OpenAIRE Downloads
81
OpenAIRE Views
28
Publicly Funded
No
Abstract
Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for High Performance Computing (HPC) applications. There are studies that address fail-stop errors and studies that address SDCs. However few studies address both types of errors together. In this paper we propose a software-based selective replication technique for HPC applications for both fail-stop errors and SDCs. Since complete replication of applications can be costly in terms of resources, we develop a runtime-based technique for selective replication. Selective replication provides an opportunity to meet HPC reliability targets while decreasing resource costs. Our technique is low-overhead, automatic and completely transparent to the user.
Description
Subasi, Omer/0000-0002-5373-7570
ORCID
Keywords
Tolerància als errors (Informàtica), Parallel processing (Electronic computers), Markov processes, Processament en paral·lel (Ordinadors), Computational modeling, Computer crashes, Reliability theory, Fault-tolerant computing, Mathematical model, Hardware, Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors, :Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC]
Fields of Science
02 engineering and technology, 0202 electrical engineering, electronic engineering, information engineering
Citation
WoS Q
N/A
Scopus Q
N/A

OpenCitations Citation Count
23
Source
17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) -- MAY 14-17, 2017 -- Madrid, SPAIN
Volume
Issue
Start Page
452
End Page
457
PlumX Metrics
Citations
CrossRef : 7
Scopus : 25
Captures
Mendeley Readers : 14
Google Scholar™


