Session V: Resilience by Design (and not as an Afterthought)

Resilience, i.e., obtaining a correct solution in a timely and efficient manner, is one of the key challenges in extreme-scale high-performance computing (HPC). The challenge is to build a reliable HPC system within a given cost budget that achieves the expected performance. Every generation of supercomputers deployed at Oak Ridge National Laboratory (ORNL) had to deal with expected and unexpected faults, errors and failures. While these supercomputers are designed to deal with expected issues, unexpected reliability problems can lead to severe degradation in operational capabilities. For example, ORNL’s Titan supercomputer experienced an unexpected increase in general-purpose graphics processing unit (GPGPU) failures between 2015 and 2017. At the peak of the problem, Titan was losing an average of 12 GPGPUs (and corresponding compute nodes) per day. Over 50% of its 18,688 GPGPUs had to be replaced. The system and the applications using it were never designed to handle such a high failure rate in an efficient manner. Other past unexpected reliability issues with supercomputers at US Department of Energy HPC centers were caused by early wear-out, dirty power, bad solder, other manufacturing issues, design errors in hardware, design errors in software and user errors. With the expected decrease in reliability due to component count increases, process technology challenges, hardware heterogeneity and software complexity, risk mitigation against unexpected issues is becoming paramount to ensure the success of future extreme-scale HPC systems. Resilience needs to be holistically provided by the HPC hardware/software ecosystem. The key challenges are to design and to operate extreme HPC systems with (1) wide-ranging resilience capabilities in hardware, system software, programming models, libraries, and applications, (2) interfaces and mechanisms for coordinating resilience capabilities across diverse hardware and software components, (3) appropriate metrics and tools for assessing performance, resilience, and energy, and (4) an understanding of the performance, resilience and energy trade-off that eventually results in well-informed HPC system design choices and runtime decisions.

Download Presentation

Location: Grand Ballroom C Date: March 28, 2019 Time: 10:30 am - 10:50 am Christian Engelmann, ORNL