Designing and Modelling Selective Replication for Fault-Tolerant HPC Applications
Loading...
Date
Journal Title
Journal ISSN
Volume Title
Publisher
Open Access Color
Green Open Access
Yes
OpenAIRE Downloads
81
OpenAIRE Views
28
Publicly Funded
No
Abstract
Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for High Performance Computing (HPC) applications. There are studies that address fail-stop errors and studies that address SDCs. However few studies address both types of errors together. In this paper we propose a software-based selective replication technique for HPC applications for both fail-stop errors and SDCs. Since complete replication of applications can be costly in terms of resources, we develop a runtime-based technique for selective replication. Selective replication provides an opportunity to meet HPC reliability targets while decreasing resource costs. Our technique is low-overhead, automatic and completely transparent to the user.
Description
Subasi, Omer/0000-0002-5373-7570
Keywords
Fault-tolerance, Selective Replication, HPC Applications, Tolerància als errors (Informàtica), Parallel processing (Electronic computers), Markov processes, Processament en paral·lel (Ordinadors), Computational modeling, Computer crashes, Reliability theory, Fault-tolerant computing, Mathematical model, Hardware, Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors, :Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC]
Fields of Science
02 engineering and technology, 0202 electrical engineering, electronic engineering, information engineering
Citation
WoS Q
Scopus Q

OpenCitations Citation Count
23
Volume
Issue
Start Page
452
End Page
457
PlumX Metrics
Citations
CrossRef : 7
Scopus : 25
Captures
Mendeley Readers : 14
SCOPUS™ Citations
25
checked on Jun 03, 2026
Web of Science™ Citations
22
checked on Jun 03, 2026
Page Views
1
checked on Jun 03, 2026
Downloads
6
checked on Jun 03, 2026
Google Scholar™


