Subasi, OmerYalcin, GulayZyulkyarov, FeradUnsal, OsmanLabarta, Jesus2025-09-252025-09-25201697815090365301552-5244https://doi.org/10.1109/CLUSTER.2016.54https://hdl.handle.net/20.500.12573/3148Subasi, Omer/0000-0002-5373-7570;In this paper we propose a runtime-based selective task replication technique for task-parallel high performance computing applications. Our selective task replication technique is automatic and does not require modification/recompilation of OS, compiler or application code. Our heuristic, we call App_FIT, selects tasks to replicate such that the specified reliability target for an application is achieved. In our experimental evaluation, we show that App_FIT selective replication heuristic is low-overhead and highly scalable. In addition, results indicate that complete task replication is overkill for achieving reliability targets. We show that with App_FIT, we can tolerate pessimistic exascale error rates with only 53% of the tasks being replicated.eninfo:eu-repo/semantics/openAccessA Runtime Heuristic to Selectively Replicate Tasks for Application-Specific Reliability TargetsConference Object10.1109/CLUSTER.2016.542-s2.0-85013177229