- This event has passed.
Colloquium – Ignacio Laguna
September 17 @ 10:00 am - 11:30 am
Dr. Ignacio Laguna
Center for Applied Scientific Computing (CASC)
Lawrence Livermore National Laboratory
Monday, September 17, 2018
Host: Ganesh Gopalakrishnan
Understanding Resilience to Soft Errors in HPC Scientific Applications
Abstract:Ensuring execution correctness and numerical reliability of high-performance computing (HPC) simulations is becoming increasingly important in extreme-scale systems. As systems scale and the number of system components grow, the chances of experiencing soft errors
increases as well. While soft errors can be in many cases detected and corrected by low-level hardware mechanisms, some errors can escape these
mechanisms and affect the results of scientific simulations. In this talk, we present a set of models and frameworks that allow us to (1)
replicate these errors in a controlled environment, (2) reason about how these errors propagate and are naturally masked (sometimes) within the
application space, and (3) protect applications from allowing these errors to propagate to the final program output.