CRC-Based Memory Reliability for Task-Parallel HPC Applications

No Thumbnail Available

Date

2016

Journal Title

Journal ISSN

Volume Title

Publisher

IEEE

Open Access Color

Green Open Access

Yes

OpenAIRE Downloads

0

OpenAIRE Views

35

Publicly Funded

No
Impulse
Average
Influence
Average
Popularity
Average

Research Projects

Journal Issue

Abstract

Memory reliability will be one of the major concerns for future HPC and Exascale systems. This concern is mostly attributed to the expected massive increase in memory capacity and the number of memory devices in Exascale systems. For memory systems Error Correcting Codes (ECC) are the most commonly used mechanism. However state-of-the art hardware ECCs will not be sufficient in terms of error coverage for future computing systems and stronger hardware ECCs providing more coverage have prohibitive costs in terms of area, power and latency. Software-based solutions are needed to cooperate with hardware. In this work, we propose a Cyclic Redundancy Checks (CRCs) based software mechanism for task-parallel HPC applications. Our mechanism incurs only 1.7% performance overhead with hardware acceleration while being highly scalable at large scale. Our mathematical analysis demonstrates the effectiveness of our scheme and its error coverage. Results show that our CRCbased mechanism reduces the memory vulnerability by 87% on average with up to 32-bit burst (consecutive) and 5-bit arbitrary error correction capability.

Description

Subasi, Omer/0000-0002-5373-7570; Labarta, Jesus/0000-0002-7489-4727

Keywords

Application programs, Cyclic redundancy check, Parallel processing (Electronic computers), Error correction capability, Errors, Processament en paral·lel (Ordinadors), Memory reliability, Task parallelism, Dataflow model, Reliability, Mathematical analysis, Reconfigurable hardware, Hardware, Software-based solutions, Hardware acceleration, Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors::Arquitectures paral·leles, Error correction, :Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], Data flow analysis

Turkish CoHE Thesis Center URL

Fields of Science

02 engineering and technology, 0202 electrical engineering, electronic engineering, information engineering

Citation

WoS Q

N/A

Scopus Q

N/A
OpenCitations Logo
OpenCitations Citation Count
4

Source

30th IEEE International Parallel and Distributed Processing Symposium (IPDPS) -- MAY 23-27, 2016 -- Illinois Inst Technol, Chicago, IL

Volume

Issue

Start Page

1101

End Page

1112
PlumX Metrics
Citations

CrossRef : 1

Scopus : 10

Captures

Mendeley Readers : 13

SCOPUS™ Citations

10

checked on Feb 03, 2026

Web of Science™ Citations

7

checked on Feb 03, 2026

Page Views

5

checked on Feb 03, 2026

Google Scholar Logo
Google Scholar™
OpenAlex Logo
OpenAlex FWCI
0.79761008

Sustainable Development Goals

SDG data is not available