- This event has passed.
Colloquium – Gene Cooperman
September 13 @ 10:00 am - 11:00 am
September 13, 2019
lecture – 10:00am
Host: Ganesh Gopalakrishnan
Checkpointing the Un-checkpointable: MANA for MPI and the Split-Process Approach
Checkpointing is the ability to save the state of a running process to stable storage, and later restarting that process from the point at which it was checkpointed. Transparent checkpointing (also known as system-level checkpointing) refers to the ability to checkpoint a (possibly MPI-parallel or distributed) application, without modifying the binaries of that target application.
This talk presents an efficient, new software architecture: split processes. The “MANA for MPI” software demonstrates this split-process architecture. The MPI application code resides in “upper-half memory”, and the MPI/network libraries reside in “lower-half memory”. The tight coupling of upper and lower half ensures low runtime overhead. And yet, when restarting from a checkpoint, “MANA for MPI” allows one to choose to replace the original lower half with a different MPI library implementation.
This approach solves the “m x n” problem. Rather than support checkpointing for all combinations of “m” MPI libraries and “n” network libraries, it suffices to checkpoint the application memory in the upper half, and then load fresh lower-half libraries for MPI and the network at the time of restart. It also supports cross-cluster migration in which the destination cluster may have a different number of cores per node, or a different network (e.g., TCP versus InfiniBand).
his talk represents joint work with Rohan Garg and Gregory Price.
Professor Cooperman works in high-performance computing and scalable applications for computational algebra. He received his B.S. from the University of Michigan in 1974, and his Ph.D. from Brown University in 1978. He then spent six years in basic research at GTE Laboratories. He came to Northeastern University in 1986, and has been a full professor there since 1992. His visiting research positions include a 5-year IDEX Chair of Attractivity at the University of Toulouse/CNRS in France, and sabbaticals at Concordia University, at CERN, and at Inria. He is one of the more than 100 co-authors on the foundational Geant4 paper, whose current citation count is at 25,000. The extension of the million-line code of Geant4 to use multi-threading (Geant4-MT) was accomplished in 2014 on the basis of joint work with his PhD student, Xin Dong. Prof. Cooperman currently leads the DMTCP project (Distributed Multi-Threaded CheckPointing) for transparent checkpointing. The project began in 2004, and has benefited from a series of PhD theses. Over 100 refereed publications cite DMTCP as having contributed to their research project.