Abstract
Resiliency is a key issue as we move toward peta-to-exascale HPC
systems that are expected to encounter multiple faults within a day,
with faults ranging from fail-stop failures to silent errors. A
natural concern is the vulnerability of long running scientific
applications on such HPC systems, often involving computations with
very large sparse matrices.
In this talk, I will illustrate the challenges posed by soft errors on
supercomputing systems, specifically in the context of iterative
methods such as conjugate gradients to solve sparse linear systems.
First, I will analyze the effects of a single soft error during the
solution process and discuss results of an empirical evaluation. Next,
I will present our new checksum encoded algorithm based fault tolerant
preconditioned conjugate gradients (PCG) method for sparse linear
system solution. Our checksum based approach can be applied to all
the key operations in PCG, including sparse matrix-vector
multiplication (SpMV), vector operations and the application of a
preconditioner through sparse triangular solution. I will discuss the
overheads of our method and compare it with a well known classical
fault tolerant algorithm. Finally, I will conclude by discussing some
future research directions.