Transparent Checkpoints of Closed Distributed Systems in Emulab

Anton Burtsev, Prashanth Radhakrishnan*, Mike Hibler, and Jay Lepreau
aburtsev@cs.utah.edu, shanth@netapp.com, mike@cs.utah.edu, and lepreau@cs.utah.edu

University of Utah, School of Computing and *NetApp
www.emulab.net

*Work performed while at the University of Utah

Abstract

Emulab is a testbed for networked and distributed systems experimentation. Two guiding principles of its design are realism and control of experimentation. There is an inherent tension between these goals, however, and in some aspects of the testbed's design, Emulab's implementers favored realism over control. Thus, Emulab provides wide-ranging control over an experiment's environment and initial conditions, but relatively little control over its execution—in particular, the ability to suspend, preempt, or replay the experiment.

We have extended Emulab with a new means of control over experiment execution: the ability to cleanly checkpoint the execution of the set of nodes and networks that comprise an experiment. Conventional checkpoint mechanisms can easily degrade the fidelity of experiment results as a consequence of checkpoint downtimes, overheads of background state saving, and unintended distributed checkpoint synchronization effects. In this paper we demonstrate a checkpointing technique that is transparent with respect to the execution of the system under test, almost completely concealing the underlying checkpoint activity.

Building on our checkpoint mechanism, we have implemented two powerful facilities for experiment execution control: the ability to preemptively swap-out experiments without losing their run-time state, and the ability to time-travel through the run of a system.

Appeared in Proceedings of the Fourth ACM European Conference on Computer Systems, pages 173–186, Nuremberg, Germany, Apr. 2009.

© ACM, 2009. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in Proceedings of the Fourth ACM European Conference on Computer Systems, Nuremberg, Germany, Apr. 2009, http://doi.acm.org/10.1145/1519065.1519084

The slides from the EuroSys 2009 presentation are also available.


Eric Eide <eeide@cs.utah.edu>
Last modified: Tue May 5 14:20:18 MDT 2009