CRC-Based Memory Reliability for Task-Parallel HPC Applications
No Thumbnail Available
Date
2016
Journal Title
Journal ISSN
Volume Title
Publisher
IEEE
Open Access Color
Green Open Access
Yes
OpenAIRE Downloads
0
OpenAIRE Views
35
Publicly Funded
No
Abstract
Memory reliability will be one of the major concerns for future HPC and Exascale systems. This concern is mostly attributed to the expected massive increase in memory capacity and the number of memory devices in Exascale systems. For memory systems Error Correcting Codes (ECC) are the most commonly used mechanism. However state-of-the art hardware ECCs will not be sufficient in terms of error coverage for future computing systems and stronger hardware ECCs providing more coverage have prohibitive costs in terms of area, power and latency. Software-based solutions are needed to cooperate with hardware. In this work, we propose a Cyclic Redundancy Checks (CRCs) based software mechanism for task-parallel HPC applications. Our mechanism incurs only 1.7% performance overhead with hardware acceleration while being highly scalable at large scale. Our mathematical analysis demonstrates the effectiveness of our scheme and its error coverage. Results show that our CRCbased mechanism reduces the memory vulnerability by 87% on average with up to 32-bit burst (consecutive) and 5-bit arbitrary error correction capability.
Description
Subasi, Omer/0000-0002-5373-7570; Labarta, Jesus/0000-0002-7489-4727
Keywords
Application programs, Cyclic redundancy check, Parallel processing (Electronic computers), Error correction capability, Errors, Processament en paral·lel (Ordinadors), Memory reliability, Task parallelism, Dataflow model, Reliability, Mathematical analysis, Reconfigurable hardware, Hardware, Software-based solutions, Hardware acceleration, Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors::Arquitectures paral·leles, Error correction, :Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], Data flow analysis
Turkish CoHE Thesis Center URL
Fields of Science
02 engineering and technology, 0202 electrical engineering, electronic engineering, information engineering
Citation
WoS Q
N/A
Scopus Q
N/A

OpenCitations Citation Count
4
Source
30th IEEE International Parallel and Distributed Processing Symposium (IPDPS) -- MAY 23-27, 2016 -- Illinois Inst Technol, Chicago, IL
Volume
Issue
Start Page
1101
End Page
1112
PlumX Metrics
Citations
CrossRef : 1
Scopus : 10
Captures
Mendeley Readers : 13
SCOPUS™ Citations
10
checked on Feb 03, 2026
Web of Science™ Citations
7
checked on Feb 03, 2026
Page Views
5
checked on Feb 03, 2026
Google Scholar™


