1. Home
  2. Browse by Author

Browsing by Author "Zyulkyarov, Ferad"

Filter results by typing the first few letters
Now showing 1 - 2 of 2
  • Results Per Page
  • Sort Options
  • Loading...
    Thumbnail Image
    Other
    Designing and Modelling Selective Replication for Fault-tolerant HPC Applications
    (IEEE, 345 E 47TH ST, NEW YORK, NY 10017 USA, 2017) Subasi, Omer; Yalcin, Gulay; Zyulkyarov, Ferad; Unsal, Osman; Labarta, Jesus; AGÜ, Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü;
    Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for High Performance Computing (HPC) applications. There are studies that address fail-stop errors and studies that address SDCs. However few studies address both types of errors together. In this paper we propose a software-based selective replication technique for HPC applications for both fail-stop errors and SDCs. Since complete replication of applications can be costly in terms of resources, we develop a runtime-based technique for selective replication. Selective replication provides an opportunity to meet HPC reliability targets while decreasing resource costs. Our technique is low-overhead, automatic and completely transparent to the user.
  • Loading...
    Thumbnail Image
    Other
    A runtime heuristic to selectively replicate tasks for application-specific reliability targets
    (IEEE, 345 E 47TH ST, NEW YORK, NY 10017 USA, 2016) Subasi, Omer; Yalcin, Gulay; Zyulkyarov, Ferad; Unsal, Osman; Labarta, Jesus; AGÜ, Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü;
    n this paper we propose a runtime-based selective task replication technique for task-parallel high performance computing applications. Our selective task replication technique is automatic and does not require modification/recompilation of OS, compiler or application code. Our heuristic, we call App_FIT, selects tasks to replicate such that the specified reliability target for an application is achieved. In our experimental evaluation, we show that App_FIT selective replication heuristic is low-overhead and highly scalable. In addition, results indicate that complete task replication is overkill for achieving reliability targets. We show that with App_FIT, we can tolerate pessimistic exascale error rates with only 53% of the tasks being replicated.