|
Disk Checkpointing for Time Travel in Distributed Systems
by
Siddharth Aggarwal
Advised by
Jay Lepreau
Emulab is a time and space shared environment consisting of a cluster of
machines. Time sharing implies the ability to re-allocate the same
machines to different experimenters over time. This requires the ability
to "swap out" or save the complete state of a machine to some external
storage. An important part of experiment swap out is the ability to
capture and save the current contents of the disk so that it can be
restored ("swapped in") later.
In addition to providing for disk rollback, this mechanism is a first step
toward a so-called "time travel" system in which the state of a
distributed application can be reverted to an earlier time. The swapout
scheme can be naturally extended to allow the system to save off
intermediate snapshots of the disk, thereby allowing the user to restore
to an intermediate image instead of just the last one. This is especially
useful for debugging, where a user can revert to previous checkpoints and
examine disk contents.
These features are implemented using a Copy On Write disk driver to
perform disk checkpointing. The main challenges in the project include
time and space efficient management of checkpoint information in memory
and on disk, fast network transfer of large number of disk blocks to
and from remote storage, and minimal latency penalty on normal disk
operations.
|