Browsing by Author "Unsal, Osman"

Now showing 1 - 3 of 3

CRC-based Memory Reliability for Task-parallel HPC Applications
(IEEE345 E 47TH ST, NEW YORK, NY 10017 USA, 2016) Subasi, Omer; Unsal, Osman; Labarta, Jesus; Yalcin, Gulay; Cristal, Adrian; AGÜ, Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü; Yalcin, Gulay
Memory reliability will be one of the major concerns for future HPC and Exascale systems. This concern is mostly attributed to the expected massive increase in memory capacity and the number of memory devices in Exascale systems. For memory systems Error Correcting Codes (ECC) are the most commonly used mechanism. However state-of-the art hardware ECCs will not be sufficient in terms of error coverage for future computing systems and stronger hardware ECCs providing more coverage have prohibitive costs in terms of area, power and latency. Software-based solutions are needed to cooperate with hardware. In this work, we propose a Cyclic Redundancy Checks (CRCs) based software mechanism for task-parallel HPC applications. Our mechanism incurs only 1.7% performance overhead with hardware acceleration while being highly scalable at large scale. Our mathematical analysis demonstrates the effectiveness of our scheme and its error coverage. Results show that our CRCbased mechanism reduces the memory vulnerability by 87% on average with up to 32-bit burst (consecutive) and 5-bit arbitrary error correction capability.
Designing and Modelling Selective Replication for Fault-tolerant HPC Applications
(IEEE, 345 E 47TH ST, NEW YORK, NY 10017 USA, 2017) Subasi, Omer; Yalcin, Gulay; Zyulkyarov, Ferad; Unsal, Osman; Labarta, Jesus; AGÜ, Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü;
Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for High Performance Computing (HPC) applications. There are studies that address fail-stop errors and studies that address SDCs. However few studies address both types of errors together. In this paper we propose a software-based selective replication technique for HPC applications for both fail-stop errors and SDCs. Since complete replication of applications can be costly in terms of resources, we develop a runtime-based technique for selective replication. Selective replication provides an opportunity to meet HPC reliability targets while decreasing resource costs. Our technique is low-overhead, automatic and completely transparent to the user.
A runtime heuristic to selectively replicate tasks for application-specific reliability targets
(IEEE, 345 E 47TH ST, NEW YORK, NY 10017 USA, 2016) Subasi, Omer; Yalcin, Gulay; Zyulkyarov, Ferad; Unsal, Osman; Labarta, Jesus; AGÜ, Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü;
n this paper we propose a runtime-based selective task replication technique for task-parallel high performance computing applications. Our selective task replication technique is automatic and does not require modification/recompilation of OS, compiler or application code. Our heuristic, we call App_FIT, selects tasks to replicate such that the specified reliability target for an application is achieved. In our experimental evaluation, we show that App_FIT selective replication heuristic is low-overhead and highly scalable. In addition, results indicate that complete task replication is overkill for achieving reliability targets. We show that with App_FIT, we can tolerate pessimistic exascale error rates with only 53% of the tasks being replicated.