University of Utah
Search
School of Computing
 

Disk Checkpointing for Time Travel in Distributed Systems

by
Siddharth Aggarwal

Advised by
Jay Lepreau

Emulab is a time and space shared environment consisting of a cluster of machines. Time sharing implies the ability to re-allocate the same machines to different experimenters over time. This requires the ability to "swap out" or save the complete state of a machine to some external storage. An important part of experiment swap out is the ability to capture and save the current contents of the disk so that it can be restored ("swapped in") later.

In addition to providing for disk rollback, this mechanism is a first step toward a so-called "time travel" system in which the state of a distributed application can be reverted to an earlier time. The swapout scheme can be naturally extended to allow the system to save off intermediate snapshots of the disk, thereby allowing the user to restore to an intermediate image instead of just the last one. This is especially useful for debugging, where a user can revert to previous checkpoints and examine disk contents.

These features are implemented using a Copy On Write disk driver to perform disk checkpointing. The main challenges in the project include time and space efficient management of checkpoint information in memory and on disk, fast network transfer of large number of disk blocks to and from remote storage, and minimal latency penalty on normal disk operations.


School of Computing • 50 S. Central Campus Dr. Rm. 3190 • Salt Lake City, UT 84112
801-581-8224 • Send comments to webmaster@cs.utah.edu
Disclaimer

Home People Research Admissions Site Map