Swarm: Scalable Wide Area Replication Middleware

Sai Susarla and John Carter, University of Utah

Figure 1: A centralized enterprise service. Clients in remote campuses access the service via RPCs.
\includegraphics[%
width=0.40\columnwidth]{figures/central-app.eps}

Figure 2: An enterprise service employing a Swarm-based proxy server. Clients in campus 2 access the local proxy server, while those in campus 3 invoke either server.
\includegraphics[%
width=0.40\columnwidth]{figures/swarm-app.eps}

Figure 3: Control flow in an enterprise service replicated using Swarm.
\includegraphics[%
width=0.15\columnwidth]{figures/swarm-flow.eps}

Modern Internet-based services increasingly operate in a dynamic wide area environment spanning diverse networks, and cater to geographically widespread users accessing them via devices of diverse capabilities. Replicating/caching service state and user data at multiple locations is well-understood as a key technique to hide this variability from users and to provide a responsive and highly available service. Replication of mutable data requires consistency management among replicas. Unlike a clustered environment, a dynamic wide area environment poses several unique challenges to replication algorithms, such as diverse network characteristics (latencies and bandwidths), Internet congestion and node churn (nodes continually joining and leaving the network). To efficiently manage replicas that are not uniformly well accessible to each other, replication algorithms must take their non-uniform communication costs into account, and allow applications to make different tradeoffs between consistency, availability and performance. Their complexity motivates the need for a reusable solution to manage data replication and consistency in new distributed services and applications.

Our research addresses the following question: is it feasible to provide reusable middleware support for effectively managing data replication in a wide variety of distributed services (e.g., ranging from distributed file services and databases to distributed multi-player games and content distribution systems like Kazaa or Gnutella)?

To identify the core desirable features of such a replication solution, we have surveyed the data sharing needs of a wide variety of distributed applications ranging from personal file access (with little data sharing) to widespread real-time collaboration (with fine-grain sharing). From our study, we made three important observations. First, we found that though replicable data is prevalent in many applications, their data characteristics (e.g., the unit of data access, its mutability and extent of sharing) and consistency requirements vary widely. This warrants a high degree of customizability in all these respects. Second, applications operate in diverse networking environments ranging from well-connected corporate servers to intermittently connected mobile devices. The ability to promiscuously replicate data and synchronize with any available replica, called pervasive replication [6], greatly enhances application availability and performance in such environments. Finally, we found that certain core design choices often recur in the consistency management of diverse applications, albeit the actual combination of choices differs from one application to another. This leads to our hypothesis that there is sufficient commonality in the replication needs of a variety of services, that could be supported in a reusable middleware.

Based on the above observations, we have developed a consistency model that offers a novel approach to flexible consistency management in dynamic environments, called composable consistency. Its consistency model can express a broader range of consistency requirements than existing models in terms of primitive design choices along several orthogonal dimensions, giving applications a high degree of customizability.

To show how composable consistency paves the way for reusable replicated data management middleware, we designed and prototyped such a middleware called Swarm1, and developed several diverse distributed services using it. Swarm is a wide area peer file service that supports aggressive replication and composable consistency behind a file system-like interface. Swarm can be used to implement and deploy wide area proxies for a variety of distributed services that operate on cached service data. In particular, Swarm has three properties that we believe are key to successful reuse of replication middleware for a variety of sharing needs: (i) customizable consistency to allow applications to tune consistency to their sharing needs and network constraints, (ii) pervasive peer replication to give applications freedom to structure themselves for efficient network usage and scaling, and (iii) a flexible interface to support a variety of useful data access paradigms. Previous wide-area replicated services lacked one or more of these properties, thus restricting their reusability.


Composable Consistency Model

We developed the composable consistency model based on the observation from our application survey that the diverse consistency needs of many applications can be decomposed into primitive consistency requirements. Those requirements can be classified along five major dimensions that are largely orthogonal:


Table 1: Consistency options provided by the Composable Consistency (CC) Model. Consistency semantics are expressed for an access session by choosing one of the alternative options in each row, which are mutually exclusive. Options in italics indicate reasonable defaults that suit many applications, unless otherwise mentioned.
Dimension Available Consistency Options
Concurrency Control
Access mode  
concurrent (RD, WR) excl (RDLK, WRLK)
Replica Synchronization
Timeliness
manual time (staleness = 0..$\infty$ secs)
  mod (unseen writes = 0..) $\mathcal{\infty})$
Strength
hard (pull) soft (push)
Semantic Deps.
- causal
- atomic
Update ordering
none total serial
Failure Handling
                   optimistic (ignore distant replicas w/ RTT > 0..$\infty$) pessimistic
Update Visibility
                                     
session per-access manual
View Isolation
                                     
session per-access manual


There are multiple reasonable options along each of these dimensions, that create a multi-dimensional space for expressing consistency requirements of applications. Table 1 lists the options we identified to include in the Composable Consistency Model. When these options are combined in various ways, they yield a rich collection of consistency semantics for reads and updates to shared data, covering the needs of a broad mix of applications. For instance, our approach lets a replicated auction service employ strong consistency for updates across all peer service replicas, while enabling different peers to answer queries with different levels of accuracy by relaxing consistency for reads to limit synchronization cost. A user can still get an accurate answer incurring higher latency by specifying a stronger consistency requirement for queries.

Using Swarm to Build Distributed Services

We envision Swarm being employed to provide coherent wide area caching to distributed services (such as file and directory services) in two ways. First, a new distributed service can be implemented as a collection of peer-to-peer service agents that operate on Swarm-hosted service data. Alternatively, an existing centralized service (such as an enterprise service) can export its data into Swarm's space at its home site, and deploy wide area service proxies that access the data via Swarm caches to improve responsiveness to wide area clients.

For example, consider an enterprise service such as payroll that serves its clients by accessing information in a centralized database. Figure 1 shows how such a service is typically organized. Clients access the service from geographically dispersed campuses by contacting the primary server in its home campus via RPCs. As the figure shows, the enterprise server is normally implemented on a cluster of machines and consists of two tiers: the ``application server'' (labeled 'AS' in the figure) implements payroll logic, while the ``DB'' server (such as mySQL, BerkeleyDB, or Oracle) handles database access when requested by 'AS' using the storage provided by a local storage service (FS).

Figure 2 shows how the same service could be organized to employ wide area proxies based on Swarm. Each enterprise server (home or proxy) is a clone of the original two-tier server with an additional component, namely, a Swarm server. Clients can now access any of the available servers, as they are functionally identical. At each enterprise server cluster, the 'AS' must be modified to issue its database queries and updates to the Swarm server by wrapping them within Swarm sessions with desired consistency semantics. In response, Swarm brings the local database to desired consistency by coordinating with Swarm servers at other proxy locations, and then invokes the operations on the local DB server. The DB server remains unchanged. Employing this 3-tier architecture offloads data replication and consistency management complexity from the enterprise service.

Figure 3 shows the architecture of the server cluster and its control flow in more detail. The DB plugin encapsulates DB-specific logic needed by Swarm to manage replicated access. It must be implemented and linked to each Swarm server and has two functions: (i) it must apply the DB query and update operations that Swarm gets from the AS and remote Swarm servers to the local database replica; (ii) it must detect and resolve conflicts among concurrent updates when requested by Swarm.

More Information

We have built a prototype of Swarm and used it to build four representative network services with distinct data characteristics and consistency needs:

  1. A peer-to-peer distributed file system that supports on a per-file basis, the strong consistency semantics of Sprite [4], the close-to-open consistency of AFS [2]and Coda [3] file systems, the weak eventual consistency flavors of Coda, Pangaea [6]and Ficus [5] file systems.
  2. An online auction service that caches enterprise database objects in a Swarm-hosted persistent object heap and achieves high throughput due to Swarm's ability to automatically adapt between aggressive caching and RPCs based on available locality.
  3. A replicated version of the BerkeleyDB [7] embedded database hosted in a Swarm file, using Swarm's semantic updates for synchronizing replicas. By using Swarm, it can support a wide variety of consistency semantics ranging from strong to eventual consistency. We used it to develop a shared music search index modelled after KaZaa.
  4. An online chat service that stresses the ability of Swarm to efficiently synchronize a large number of replicas in real-time.
Click here for an overview of the Swarm data store and the above services.


Publications

 o Flexible Consistency for Wide area Peer Replication . Sai Susarla and John Carter. To appear in Proceedings of the 25th International Conference on Distributed Computing Systems, June 2005.
Also available as technical report UUCS-04-016.
This paper describes the composable consistency model.

 o Middleware Support for Locality-aware Wide area Replication . Sai Susarla and John Carter. Technical report UUCS-04-017 Department of Computer Science, University of Utah.
This report presents the design and implementation of the Swarm data store.


Bibliography

1
J.B. Carter, A. Ranganathan, and S. Susarla.
Khazana: An infrastructure for building distributed services.
In Proceedings of The Eighteenth International Conference on Distributed Computing Systems, pages 562-571, May 1998.

2
J. Howard, M. Kazar, S. Menees, D. Nichols, M. Satyanarayanan, R. Sidebotham, and M. West.
Scale and performance in a distributed file system.
ACM Transactions on Computer Systems, 6(1):51-82, February 1988.

3
J.J. Kistler and M. Satyanarayanan.
Disconnected operation in the Coda file system.
In Proceedings of the 13th Symposium on Operating Systems Principles, pages 213-225, October 1991.

4
M.N. Nelson, B.B. Welch, and J.K. Ousterhout.
Caching in the Sprite network file system.
ACM Transactions on Computer Systems, 6(1):134-154, 1988.

5
P. Reiher, J. Heidemann, D. Ratner, G. Skinner, and G. Popek.
Resolving file conflicts in the Ficus file system.
In Proceedings of the 1994 Summer Usenix Conference, 1994.

6
Y. Saito, C. Karamanolis, M. Karlsson, and M. Mahalingam.
Taming aggressive replication in the Pangaea wide-area file system.
In Proceedings of the Fifth Symposium on Operating System Design and Implementation, pages 15-30, 2002.

7
Sleepycat Software.
The berkeleydb database.
http://sleepycat.com/, 2000.

8
S. Susarla and J. Carter.
Khazana: A flexible wide-area data store.
Technical Report UUCS-03-020 (http://www.cs.utah.edu/~sai/papers/khazana-tr-3-20.pdf), University of Utah School of Computer Science, October 2003.


Footnotes

...Swarm1
Swarm is a redesign of our earlier system called Khazana. Khazana had similar goals, but employed a distributed shared memory system (DSM) abstraction [1] with a flat address space. Like many researchers, we found that DSM abstraction is not well-suited for wide area environments. Hence we designed Swarm from the ground up to support distributed applications and services with a file system abstraction.


Sai Susarla 2004-08-09