Designing and Modelling Selective Replication for Fault-Tolerant HPC Applications
No Thumbnail Available
Date
2017
Journal Title
Journal ISSN
Volume Title
Publisher
IEEE
Open Access Color
Green Open Access
Yes
OpenAIRE Downloads
81
OpenAIRE Views
28
Publicly Funded
No
Abstract
Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for High Performance Computing (HPC) applications. There are studies that address fail-stop errors and studies that address SDCs. However few studies address both types of errors together. In this paper we propose a software-based selective replication technique for HPC applications for both fail-stop errors and SDCs. Since complete replication of applications can be costly in terms of resources, we develop a runtime-based technique for selective replication. Selective replication provides an opportunity to meet HPC reliability targets while decreasing resource costs. Our technique is low-overhead, automatic and completely transparent to the user.
Description
Subasi, Omer/0000-0002-5373-7570
ORCID
Keywords
Tolerància als errors (Informàtica), Parallel processing (Electronic computers), Markov processes, Processament en paral·lel (Ordinadors), Computational modeling, Computer crashes, Reliability theory, Fault-tolerant computing, Mathematical model, Hardware, Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors, :Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC]
Turkish CoHE Thesis Center URL
Fields of Science
02 engineering and technology, 0202 electrical engineering, electronic engineering, information engineering
Citation
WoS Q
N/A
Scopus Q
N/A

OpenCitations Citation Count
22
Source
17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) -- MAY 14-17, 2017 -- Madrid, SPAIN
Volume
Issue
Start Page
452
End Page
457
PlumX Metrics
Citations
CrossRef : 7
Scopus : 25
Captures
Mendeley Readers : 14
SCOPUS™ Citations
25
checked on Feb 03, 2026
Web of Science™ Citations
21
checked on Feb 03, 2026
Page Views
8
checked on Feb 03, 2026
Google Scholar™


