Browsing by Author "Yalcin, Gulay"
Now showing 1 - 6 of 6
- Results Per Page
- Sort Options
Conference Object Citation - WoS: 21Citation - Scopus: 25Designing and Modelling Selective Replication for Fault-Tolerant HPC Applications(IEEE, 2017) Subasi, Omer; Yalcin, Gulay; Zyulkyarov, Ferad; Unsal, Osman; Labarta, JesusFail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for High Performance Computing (HPC) applications. There are studies that address fail-stop errors and studies that address SDCs. However few studies address both types of errors together. In this paper we propose a software-based selective replication technique for HPC applications for both fail-stop errors and SDCs. Since complete replication of applications can be costly in terms of resources, we develop a runtime-based technique for selective replication. Selective replication provides an opportunity to meet HPC reliability targets while decreasing resource costs. Our technique is low-overhead, automatic and completely transparent to the user.Article Citation - WoS: 6Citation - Scopus: 10A Methodology for Comparing the Reliability of GPU-Based and CPU-Based HPCS(Assoc Computing Machinery, 2020) Cini, Nevin; Yalcin, GulayToday, GPUs are widely used as coprocessors/accelerators in High-Performance Heterogeneous Computing due to their many advantages. However, many researches emphasize that GPUs are not as reliable as desired yet. Despite the fact that GPUs are more vulnerable to hardware errors than CPUs, the use of GPUs in HPCs is increasing more and more. Moreover, due to native reliability problems of GPUs, combining a great number of GPUs with CPUs can significantly increase HPCs' failure rates. For this reason, analyzing the reliability characteristics of GPU-based HPCs has become a very important issue. Therefore, in this study we evaluate the reliability of GPU-based HPCs. For this purpose, we first examined field data analysis studies for GPU-based and CPU-based HPCs and identified factors that could increase systems failure/error rates. We then compared GPU-based HPCs with CPU-based HPCs in terms of reliability with the help of these factors in order to point out reliability challenges of GPU-based HPCs. Our primary goal is to present a study that can guide the researchers in this field by indicating the current state of GPU-based heterogeneous HPCs and requirements for the future, in terms of reliability. Our second goal is to offer a methodology to compare the reliability of GPU-based HPCs and CPU-based HPCs. To the best of our knowledge, this is the first survey study to compare the reliability of GPU-based and CPU-based HPCs in a systematic manner.Article Spec17Tre: A New Dataset in Hardware Security and Using Deep Learning for Detecting Spectre Attacks(Springer Heidelberg, 2025) Aktas-Aydin, Hatice; Yalcin, GulayComputer performance has become a significant subject of study due to the processing of big data, the complexity of calculations and the importance of time efficiency. Many companies are improving processor operating principles to increase performance. The most common methods for this purpose are speculative execution and cache usage. While these techniques improve performance, they also introduce certain security vulnerabilities. Spectre is an attack that exploits vulnerabilities created by speculative execution, affecting all modern processor architectures. Research has shown that using machine learning to detect these attacks can be quite effective, although the features are typically gathered at the software level, which may limit detection since some performance parameters are not conveyed to the software. This study presents an analysis of Spectre attacks and their detection using machine learning and deep learning methods at the hardware level. Experiments are conducted using GEM5, a full-system hardware simulator, to ensure that only hardware-visible performance parameters are also collected. Attack detection is performed using Support Vector Machine (SVM) and Long Short-Term Memory (LSTM) methods. The LSTM method is used in conjunction with SVM and Convolutional Neural Network (CNN) techniques, and all models were tested on a new dataset, Spec17Tre, created using "519.lbm" from the SPEC CPU2017 benchmarks. The study achieved a 95% accuracy rate in attack detection using the LSTM + CNN hybrid model, which also yielded an F1 score of 0.999 for detecting applied Spectre attack scenarios.Article CompreCity: Accelerating the Traveling Salesman Problem on GPU With Data Compression(Elsevier, 2025) Yalcin, Salih; Usul, Hamdi Burak; Yalcin, GulayTraveling Salesman Problem (TSP) is one of the significant problems in computer science which tries to find the shortest path for a salesman who needs to visit a set of cities and it is involved in many computing problems such as networks, genome analysis, logistics etc. Using parallel executing paradigms, especially GPUs, is appealing in order to reduce the problem solving time of TSP. One of the main issues in GPUs is to have limited GPU memory which would not be enough for the entire data. Therefore, transferring data from the host device would reduce the performance in execution time. In this study, we applied three data compression methodologies to represent cities in the TSP such as (1) Using Greatest Common Divisor (2) Shift Cities to the Origin (3) Splitting Surface to Grids. Therefore, we include more cities in GPU memory and reduce the number of data transfers from the host device. We implement our methodology in Iterated Local Search (ILS) algorithm with 2-opt and The Lin-Kernighan-Helsgaun (LKH) Algorithm. We show that our implementation presents more than 25% performance improvement for both algorithms.Conference Object Citation - WoS: 7Citation - Scopus: 10CRC-Based Memory Reliability for Task-Parallel HPC Applications(IEEE, 2016) Subasi, Omer; Unsal, Osman; Labarta, Jesus; Yalcin, Gulay; Cristal, AdrianMemory reliability will be one of the major concerns for future HPC and Exascale systems. This concern is mostly attributed to the expected massive increase in memory capacity and the number of memory devices in Exascale systems. For memory systems Error Correcting Codes (ECC) are the most commonly used mechanism. However state-of-the art hardware ECCs will not be sufficient in terms of error coverage for future computing systems and stronger hardware ECCs providing more coverage have prohibitive costs in terms of area, power and latency. Software-based solutions are needed to cooperate with hardware. In this work, we propose a Cyclic Redundancy Checks (CRCs) based software mechanism for task-parallel HPC applications. Our mechanism incurs only 1.7% performance overhead with hardware acceleration while being highly scalable at large scale. Our mathematical analysis demonstrates the effectiveness of our scheme and its error coverage. Results show that our CRCbased mechanism reduces the memory vulnerability by 87% on average with up to 32-bit burst (consecutive) and 5-bit arbitrary error correction capability.Conference Object Citation - WoS: 6Citation - Scopus: 7A Runtime Heuristic to Selectively Replicate Tasks for Application-Specific Reliability Targets(IEEE, 2016) Subasi, Omer; Yalcin, Gulay; Zyulkyarov, Ferad; Unsal, Osman; Labarta, JesusIn this paper we propose a runtime-based selective task replication technique for task-parallel high performance computing applications. Our selective task replication technique is automatic and does not require modification/recompilation of OS, compiler or application code. Our heuristic, we call App_FIT, selects tasks to replicate such that the specified reliability target for an application is achieved. In our experimental evaluation, we show that App_FIT selective replication heuristic is low-overhead and highly scalable. In addition, results indicate that complete task replication is overkill for achieving reliability targets. We show that with App_FIT, we can tolerate pessimistic exascale error rates with only 53% of the tasks being replicated.

