A Branching Storage System for Linux

Prashanth Radhakrishnan, Mike Hibler and Jay Lepreau

University of Utah

November 2006

www.cs.utah.edu/flux, www.emulab.net

Overview

Two of our current Emulab-related projects (the Experimentation Workbench and Time-travel) require support for storing versioned data in the form of trees. A storage system with builtin support for "branching" most naturally provides the tree semantics. The building blocks of such a branching storage system are "snapshots" and "clones". A snapshot is an immutable point-in-time "virtual copy" of the original storage. A clone is a point-in-time virtual copy that can be mutated independent of its source. Recursive snapshots and clones produce trees (with clones at leaves and snapshots at all other non-leaf nodes).

Rather than build a new system from scratch, we opted to enhance a popular open-source storage system with tree semantics, in order to leverage its solid development base. An important criterion was that we wanted the branching functionality implemented at the lowest level of the storage stack---the block-layer. As a result any system that operates above the block-layer, such as file-systems or databases, can trivially branch. We chose to use Linux Logical Volume Manager (LVM), that satisfies these criteria.

LVM is a logical volume management software that is part of standard Linux distributions. It virtualizes disks to provide features such as software RAID, flexible storage allocation and mutable snapshots. LVM is built on top of the Linux Device-Mapper kernel driver that provides a modular framework for constructing layered virtual block devices. User-level LVM tools create the "volume" abstraction by appropriately layering the device-mapper devices.

Recent research systems that provide block-level branching are mostly illustrative prototypes (Parallax), while some also have broader scope (Olive) that brings the complexity and overhead of distributed protocols. SUN's ZFS and Network Appliance Filers can provide branching capabilities, but operate at the file-system layer using WAFL (Write-Anywhere-File-Layout).

Current Status and Future Work

To suit the recursive branching needs, we have logically split the mutable snapshot functionality of LVM into immutable snapshots and mutable clones. Also, the original snapshot implementation was highly inefficient at handling snapshot chains. We have optimized it since, improving the write performance and snapshot scalability by an order of magnitude. LVM enhancements to support recursive snapshots and clones are also complete and our initial results with branching are encouraging.

This system was conceived to be a robust, production storage system. Towards this end, our future work will explore techniques that will improve the copy-on-write (CoW) performance, storage efficiency and system scalability (i.e., allow deeper trees). We also may verify the correctness of our implementation by leveraging recent work on model checking for storage systems (EXPLODE).