CRC-Based Memory Reliability for Task-Parallel HPC Applications

dc.contributor.author Subasi, Omer
dc.contributor.author Unsal, Osman
dc.contributor.author Labarta, Jesus
dc.contributor.author Yalcin, Gulay
dc.contributor.author Cristal, Adrian
dc.date.accessioned 2025-09-25T10:42:04Z
dc.date.available 2025-09-25T10:42:04Z
dc.date.issued 2016
dc.description Subasi, Omer/0000-0002-5373-7570; Labarta, Jesus/0000-0002-7489-4727 en_US
dc.description.abstract Memory reliability will be one of the major concerns for future HPC and Exascale systems. This concern is mostly attributed to the expected massive increase in memory capacity and the number of memory devices in Exascale systems. For memory systems Error Correcting Codes (ECC) are the most commonly used mechanism. However state-of-the art hardware ECCs will not be sufficient in terms of error coverage for future computing systems and stronger hardware ECCs providing more coverage have prohibitive costs in terms of area, power and latency. Software-based solutions are needed to cooperate with hardware. In this work, we propose a Cyclic Redundancy Checks (CRCs) based software mechanism for task-parallel HPC applications. Our mechanism incurs only 1.7% performance overhead with hardware acceleration while being highly scalable at large scale. Our mathematical analysis demonstrates the effectiveness of our scheme and its error coverage. Results show that our CRCbased mechanism reduces the memory vulnerability by 87% on average with up to 32-bit burst (consecutive) and 5-bit arbitrary error correction capability. en_US
dc.identifier.doi 10.1109/IPDPS.2016.70
dc.identifier.isbn 9781509021406
dc.identifier.issn 1530-2075
dc.identifier.scopus 2-s2.0-84983289228
dc.identifier.uri https://doi.org/10.1109/IPDPS.2016.70
dc.identifier.uri https://hdl.handle.net/20.500.12573/3407
dc.language.iso en en_US
dc.publisher IEEE en_US
dc.relation.ispartof 30th IEEE International Parallel and Distributed Processing Symposium (IPDPS) -- MAY 23-27, 2016 -- Illinois Inst Technol, Chicago, IL en_US
dc.relation.ispartofseries International Parallel and Distributed Processing Symposium IPDPS
dc.rights info:eu-repo/semantics/closedAccess en_US
dc.title CRC-Based Memory Reliability for Task-Parallel HPC Applications en_US
dc.type Conference Object en_US
dspace.entity.type Publication
gdc.author.id Subasi, Omer/0000-0002-5373-7570
gdc.author.id Labarta, Jesus/0000-0002-7489-4727
gdc.author.scopusid 57144377900
gdc.author.scopusid 35612224700
gdc.author.scopusid 56256013400
gdc.author.scopusid 23029394200
gdc.author.scopusid 55884958300
gdc.author.wosid Labarta, Jesus/G-5256-2015
gdc.author.wosid Unsal, Osman/B-9161-2016
gdc.author.wosid Cristal, Adrian/Aal-9102-2020
gdc.author.wosid Labarta, Jesus/G-5256-2015
gdc.bip.impulseclass C5
gdc.bip.influenceclass C5
gdc.bip.popularityclass C5
gdc.coar.access metadata only access
gdc.coar.type text::conference output
gdc.collaboration.industrial false
gdc.description.department Abdullah Gül University en_US
gdc.description.departmenttemp [Subasi, Omer; Unsal, Osman; Labarta, Jesus; Cristal, Adrian] Barcelona Supercomputing Ctr, Barcelona, Spain; [Subasi, Omer; Labarta, Jesus; Cristal, Adrian] Univ Politecn Cataluna, E-08028 Barcelona, Spain; [Yalcin, Gulay] Abdullah Gul Univ, Kayseri, Turkey; [Cristal, Adrian] CSIC Spanish Natl Res Council, IIIA Artificial Intelligence Res Inst, Madrid, Spain en_US
gdc.description.endpage 1112 en_US
gdc.description.publicationcategory Konferans Öğesi - Uluslararası - Kurum Öğretim Elemanı en_US
gdc.description.scopusquality N/A
gdc.description.startpage 1101 en_US
gdc.description.woscitationindex Conference Proceedings Citation Index - Science
gdc.description.wosquality N/A
gdc.identifier.openalex W2492769945
gdc.identifier.wos WOS:000391251800113
gdc.index.type WoS
gdc.index.type Scopus
gdc.oaire.diamondjournal false
gdc.oaire.downloads 0
gdc.oaire.impulse 3.0
gdc.oaire.influence 2.7558686E-9
gdc.oaire.isgreen true
gdc.oaire.keywords Application programs
gdc.oaire.keywords Cyclic redundancy check
gdc.oaire.keywords Parallel processing (Electronic computers)
gdc.oaire.keywords Error correction capability
gdc.oaire.keywords Errors
gdc.oaire.keywords Processament en paral·lel (Ordinadors)
gdc.oaire.keywords Memory reliability
gdc.oaire.keywords Task parallelism
gdc.oaire.keywords Dataflow model
gdc.oaire.keywords Reliability
gdc.oaire.keywords Mathematical analysis
gdc.oaire.keywords Reconfigurable hardware
gdc.oaire.keywords Hardware
gdc.oaire.keywords Software-based solutions
gdc.oaire.keywords Hardware acceleration
gdc.oaire.keywords Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors::Arquitectures paral·leles
gdc.oaire.keywords Error correction
gdc.oaire.keywords :Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC]
gdc.oaire.keywords Data flow analysis
gdc.oaire.popularity 1.7001335E-9
gdc.oaire.publicfunded false
gdc.oaire.sciencefields 02 engineering and technology
gdc.oaire.sciencefields 0202 electrical engineering, electronic engineering, information engineering
gdc.oaire.views 35
gdc.openalex.collaboration International
gdc.openalex.fwci 0.79761008
gdc.openalex.normalizedpercentile 0.78
gdc.opencitations.count 4
gdc.plumx.crossrefcites 1
gdc.plumx.mendeley 13
gdc.plumx.scopuscites 10
gdc.scopus.citedcount 10
gdc.virtual.author Yalçın Alkan, Gülay
gdc.wos.citedcount 7
relation.isAuthorOfPublication e0dc9e40-f936-402f-96c6-f4e668a0b9d3
relation.isAuthorOfPublication.latestForDiscovery e0dc9e40-f936-402f-96c6-f4e668a0b9d3
relation.isOrgUnitOfPublication 665d3039-05f8-4a25-9a3c-b9550bffecef
relation.isOrgUnitOfPublication 52f507ab-f278-4a1f-824c-44da2a86bd51
relation.isOrgUnitOfPublication ef13a800-4c99-4124-81e0-3e25b33c0c2b
relation.isOrgUnitOfPublication.latestForDiscovery 665d3039-05f8-4a25-9a3c-b9550bffecef

Files