Subasi, OmerYalcin, GulayZyulkyarov, FeradUnsal, OsmanLabarta, Jesus2025-09-252025-09-25201797815090661172376-4414https://doi.org/10.1109/CCGRID.2017.40https://hdl.handle.net/20.500.12573/3595Subasi, Omer/0000-0002-5373-7570Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for High Performance Computing (HPC) applications. There are studies that address fail-stop errors and studies that address SDCs. However few studies address both types of errors together. In this paper we propose a software-based selective replication technique for HPC applications for both fail-stop errors and SDCs. Since complete replication of applications can be costly in terms of resources, we develop a runtime-based technique for selective replication. Selective replication provides an opportunity to meet HPC reliability targets while decreasing resource costs. Our technique is low-overhead, automatic and completely transparent to the user.eninfo:eu-repo/semantics/openAccessDesigning and Modelling Selective Replication for Fault-Tolerant HPC ApplicationsConference Object10.1109/CCGRID.2017.402-s2.0-85027467982