Designing and Modelling Selective Replication for Fault-Tolerant HPC Applications

Loading...

Journal Title

Journal ISSN

Volume Title

Publisher

Open Access Color

Green Open Access

Yes

OpenAIRE Downloads

81

OpenAIRE Views

28

Publicly Funded

No
Impulse
Top 10%
Influence
Top 10%
Popularity
Top 10%

relationships.isProjectOf

relationships.isJournalIssueOf

Abstract

Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for High Performance Computing (HPC) applications. There are studies that address fail-stop errors and studies that address SDCs. However few studies address both types of errors together. In this paper we propose a software-based selective replication technique for HPC applications for both fail-stop errors and SDCs. Since complete replication of applications can be costly in terms of resources, we develop a runtime-based technique for selective replication. Selective replication provides an opportunity to meet HPC reliability targets while decreasing resource costs. Our technique is low-overhead, automatic and completely transparent to the user.

Description

Subasi, Omer/0000-0002-5373-7570

Keywords

Fault-tolerance, Selective Replication, HPC Applications, Tolerància als errors (Informàtica), Parallel processing (Electronic computers), Markov processes, Processament en paral·lel (Ordinadors), Computational modeling, Computer crashes, Reliability theory, Fault-tolerant computing, Mathematical model, Hardware, Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors, :Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC]

Fields of Science

02 engineering and technology, 0202 electrical engineering, electronic engineering, information engineering

Citation

WoS Q

Scopus Q

OpenCitations Logo
OpenCitations Citation Count
23

Volume

Issue

Start Page

452

End Page

457
PlumX Metrics
Citations

CrossRef : 7

Scopus : 25

Captures

Mendeley Readers : 14

SCOPUS™ Citations

25

checked on Jun 03, 2026

Web of Science™ Citations

22

checked on Jun 03, 2026

Page Views

1

checked on Jun 03, 2026

Downloads

6

checked on Jun 03, 2026

Google Scholar Logo
Google Scholar™
OpenAlex Logo
OpenAlex FWCI
5.04

Sustainable Development Goals

SDG data is not available