Flexible Consistency Management in a Wide-area Caching Data Store

Sai Susarla


As the number and variety of shared, networked information-based services and users' dependence on them increases, there is an increasing need for a global storage management base to support and manage these services effectively. As data gets more distributed for global reach as well as for increased availability, consistency management becomes a major issue. The scalability of consistency management within the global storage management system and its customizability to a variety of application semantics and requirements as they arise and change over time is a key to making such a system useful.

In this thesis, we propose to demonstrate the feasibility of implementing such a flexible consistency management system in the context of a wide-area data storage infrastructure, and its value in catering to a variety of application needs. For this, we will build an experimental prototype file storage system that provides coherent file access at page-granularity across a WAN. We will implement the following consistency options on a per-file basis: strict (pessimistic, lease-based), (optimistic) append-only and last-writer-wins consistency. We will also give applications control over various aspects of consistency implementation such as frequency and direction of update propagation and the level of tolerance to staleness of data.

We demonstrate the value of our consistency management framework by building four representative services from a variety of distinct classes. We show how the set of consistency options and the control parameters that we support are sufficient to provide the appropriate level of data coherence in these applications efficiently. The four services are: 1) a simple distributed file store that supports a variety of file-sharing patterns across the wide-area; 2) a directory implemented as a page-based hash table within a file to illustrate strictly consistent access; 3) a chat room service implemented as concurrent appends to a single file, illustrative of distributed data/event logging and many-to-many data-flow; 4) a scoreboard application that broadcasts the status of an ongoing game updated at a site to registered listeners across the Internet., to illustrate one-to-many data dissemination with ability to tune the amount of update traffic generated based on network conditions and how much staleness can be tolerated by listeners.


1. Introduction

In this proposal, we consider the problem of maintaining data consistency in wide-area data-intensive applications. Many Internet services are data-intensive. They involve the access and delivery of shared data among distributed components, e.g., clients and servers. Some example services include shared file services, the WWW, email, Usenet, directory services, data logging, and streaming and publish-subscribe applications. Many distributed services typically cache/replicate data close to where it is accessed to improve availability and access latency. Caching is especially beneficial across the wide-area to hide the effects of variable communication latencies and bandwidths and intermittent connectivity typical of the wide-area environment.

However, caching introduces the problem of maintaining consistency among data replicas. A key issue in providing consistent access to replicated data is the tension between the degree of replica coherence and the communication costs required to maintain it. This problem of consistency management is complicated in the wide-area environment by non-uniform network characteristics between nodes, the potentially large numbers of clients present, and the infeasibility of obtaining global state for making local decisions. Several consistency schemes have been developed to address this tension in the context of specific wide-area services to suit specific data access patterns [13,20,37,16,28,32]. We observe (in Chapter 3) that many of these services employ different combinations of a common set of mechanisms to handle various aspects of consistency management, such as concurrency control and update propagation. However, each existing consistency solution inseparably bundles together the individual mechanisms that suit the service requirements for efficiency. Although this approach provides good performance for that service, it hinders the reuse of these consistency mechanisms in a different combination to accommodate different service requirements. As a result, wide-area distributed services involving replicated data management are hard to develop and deploy, because the consistency solution for each service is essentially developed and tuned from scratch.

1.1 Problem Statement

Managing the consistency of cached/replicated data requires significant development effort, despite the emergence of many useful mechanisms in the context of individual applications. This is because many existing services integrate the collection of consistency mechanisms they need in a monolithic design. As a result, individual mechanisms cannot be isolated easily from various existing services and reused together in a new service, without significant engineering effort. If the consistency needs of the new service are not almost identical to those of some existing service, it becomes very difficult to reuse existing algorithms or code and achieve good performance.

In this thesis, we address the problem of developing a flexible and reusable consistency management framework that relieves some of the development burden for a variety of Internet services. For the purpose of this thesis, we focus on supporting the data coherence needs of applications involving (i) read/write access to shared data/state, (ii) data dissemination, and/or (iii) data exchange among distributed components over the wide-area.

Existing consistency mechanisms vary with respect to when they allow access to a replica, when and how they propagate updates to synchronize replicas, and how they detect and resolve update conflicts (if they arise). The right set of policies to employ depends heavily on the individual application's data access pattern. Hence, application control over these policies is essential for the efficiency and scalability of the resulting system. We believe that many application-level consistency requirements can be translated into a few core system policy choices that can be modified to tailor the consistency implementation for each individual application. Some choices that we identified are: push- vs. pull-based update transfer, multiple update transfer conditions, and a variety of consistency guarantees. Hence, letting applications directly make these choices helps achieve efficient consistency management.

1.2 Thesis Statement

We hypothesize that:

A consistency management system that provides a small set of customizable coherence mechanisms can efficiently satisfy the data coherence needs of a wide variety of distributed applications.

To support this hypothesis, we propose to develop a system that can efficiently provide consistent shared data access and data delivery to a variety of useful wide-area applications by exporting a common set of mechanisms needed by them. Its mechanisms are reusable and their behavior can be customized to application-specific needs for efficiency. Using our system will relieve application writers of much of the burden of implementing wide-area consistency management, and will enable them to focus on other issues.

1.3 Scope

In this thesis, we only address the issue of maintaining consistency among a given network of data replicas across the wide-area, while handling variable latencies and bandwidths among the network links. There are many other important issues to wide-area data access that are out of the scope of this thesis, including:

We believe that our framework can be easily modified to coexist with other effective techniques to address these issues without reducing its core value.

A replica network can be leveraged not only to minimize the raw communication costs between nodes, but also to limit the consistency management state for scalability. For this thesis, we limit our focus to techniques to leverage a given replica hierarchy for scalable consistency management. However, for our evaluation prototype, we employ a multi-level caching scheme that works well for a variety of applications. This scheme creates and dissolves cached copies based on usage and links them by a dynamic replica hierarchy to reduce the cost of consistency-related communication by taking into account the network quality between copy sites. Although our scheme may not always create an optimal replica topology, the design of our framework allows it to be easily replaced by better schemes (such as those of Bayeux [40] and Scribe [31]) without reducing the effectiveness of our framework.

Replication for fault tolerance is orthogonal to the problems addressed by our thesis. Our focus is on keeping a set of secondary replicas (i.e., transient copies) coherent with each other and a primary replica (i.e., a permanent copy) of a data item in the face of updates. In contrast, replication for fault tolerance typically involves maintaining multiple primary replicas in sync with each other to protect against permanent loss of some of them. We believe that our system can coexist with many fault-tolerance solutions such as clustered services [14] and other quorum/voting replication schemes [19,27]. Their protocols only operate among primary replicas. They have little interaction with secondary replicas and hence little impact on the consistency management issues addressed by this thesis.

Lastly, protection of data against unauthorized access is especially important across the wide-area, and a heterogeneous caching solution is not very useful without an effective security solution. However, this thesis addresses some other important issues that are orthogonal to security. We believe that the consistency management solution that we propose can be easily modified to incorporate a security solution, but we leave the exploration of their interaction for future research.

1.4 Approach

Our approach to achieving scalability and performance in the wide-area is to adopt a multi-level caching network similar to that proposed by Blaze [1], but to make use of it in novel ways. We use this network for supporting multiple coherence policies. Our implementation of these policies takes the non-uniform communication costs between nodes into account for consistency-related communication. We provide multiple policy-independent update propagation mechanisms that leverage the replica hierarchy for scalability. We delegate workload among replicas based on network proximity. We resolve update conflicts at each level of the replica hierarchy to limit their spread. Each replica only needs to maintain consistency management state for its immediate neighbors in the replica hierarchy. This helps our system scale to a large number of replicas and clients. These design issues are unique to the wide area and do not arise in a LAN environment due to its uniform and relatively low communication costs.

Previous systems provide one or the other of scalable wide-area support and flexible consistency in the form of multiple consistency policies. Our approach is unique in that in addition to supporting multiple policies, it also gives applications choice over a variety of mechanisms for enforcing these policies over the wide-area. This choice enables our system to support a wider variety of applications efficiently. To identify the aspects in which application control over consistency mechanisms is beneficial, we examine the problem of consistency management closely in Chapter 3. We identified various techniques based on issues such as when access is allowed to a replica (e.g., optimistic vs. pessimistic access control), where can updates be issued (e.g., single-master vs. multi-master), what direction do updates get transfered (e.g., pull-based vs. push-based updates), and the conditions that can trigger update transfer (e.g., timer-driven, staleness-driven, eager or lazy transfer). These issues are common to many consistency policies. However, existing systems bundle these decisions inseparably into a policy's implementation.

1.5 Validation of Thesis

The success of our thesis lies in demonstrating that scalable flexible consistency management over the wide-area is both viable and useful. We propose to demonstrate its viability by showing that it can be implemented in a manner that is both scalable and resilient to node/network failures and intermittent connectivity, which is essential in a wide-area environment. The availability of an application's data for access in partition situations is determined by the consistency guarantees desired, and is thus application-dependent. We propose to show that our design can scale reasonably well with workload along several dimensions: 1) number of clients, 2) number of servers, 3) number of objects, 4) number of replicas per object, 5) object access rate (operations per sec), and 6) variability of network quality (latency and bandwidth).

We also propose to demonstrate that our approach has the following useful benefits over existing approaches to consistency management. First, it enables the reuse of consistency management techniques across a variety of wide-area applications. Second, support for multiple coexisting consistency policies lets an application meet the specific consistency, availability and intermittent connectivity handling needs of its individual data items. This support also lets the application-writer evaluate the suitability of various options for a given data item without significant implementation effort. Third, the flexibility offered by our approach in the choice of implementation techniques helps an application to tune its consistency to achieve good performance and scalability. Lastly, application control over update propagation strategy on a per-replica basis helps meet the diverse data quality requirements of different clients for a given data item within an application.

To validate these claims, we propose to implement our consistency management system in the context of a wide-area data store that we are developing called Khazana, which we describe in detail in Chapter 4. Khazana is organized as a collection of peer data servers that cooperate to provide coherent access to flat files across a wide-area. It provides user choice over a variety of consistency policies and implementation techniques on a per-file basis, which we describe in Chapter 3. We then propose to develop and evaluate four applications with distinct data access patterns and consistency requirements using Khazana, that are representative of the three classes of applications mentioned in Section 1.1.

We plan to conduct a series of measuring experiments with these applications to validate each of our claims. The first experiment demonstrates how our system scales to a large number of files with different consistency choices coexisting in a single wide-area application. The second experiment shows how the customizability of mechanisms for implementing a given consistency policy can improve application performance. The third experiment evaluates the ability of our system to deal with intermittent connectivity. It demonstrates how a variety of consistency and availability tradeoffs can be achieved using our framework. The fourth experiment demonstrates that our system efficiently supports distributed data structures with stringent consistency requirements, such as those used in a directory service. The fifth experiment tests the latency and efficiency of real-time data dissemination in Khazana and shows that our asynchronous update-propagation approach helps achieve scalable real-time data exchange. The last experiment evaluates how the ability to customize update propagation strategy enables us to provide a variety of data quality levels at individual replica sites. We describe these experiments in detail in Chapter 5.

Lastly, we plan to evaluate the performance and scalability of two real-world applications based on our framework namely, a chat room service and a directory application, relative to a popular existing implementation. We illustrate how the versions based on our system are simplified by using our framework but can still perform reasonably when compared to existing versions of these applications.

1.6 Conclusion

In this thesis, we propose to show the feasibility and value of implementing a wide-area consistency management system that is flexible enough to satisfy the data access and delivery needs of a variety of wide-area applications. We target applications in the three classes, namely those involving 1) read/write access to shared data/state, 2) data dissemination and 3) data exchange among distributed components over the wide-area. Our approach to achieving scalability, flexibility and performance is to provide a set of useful consistency guarantees for our targeted application classes while giving applications significant control over their implementation.

Our intended contributions are:

  1. to identify a set of consistency policies and corresponding implementation strategies that efficiently support the needs of a variety of wide-area applications,
  2. to implement these mechanisms in a wide-area environment that deals with non-uniform network quality between nodes while retaining flexibility, and
  3. to evaluate the effectiveness and scalability of the resulting system for a variety of applications.
In the next chapter, we motivate our design by presenting prior research into data consistency and contrasting it with our approach. In Chapter 3, we examine the problem of consistency management closely to identify areas where application control is beneficial and explain the rationale behind our design. In Chapter 4, we describe the design of a prototype data store called Khazana that we propose to build to validate our thesis, with special emphasis on its consistency management subsystem. In Chapter 5, we describe several representative applications, a set of experiments using them and their expected results that together will serve to validate our thesis' claims. We then sketch an implementation schedule for our thesis in Chapter 6. Lastly, we draw conclusions in Chapter 7.

2. Background

Much research effort has gone into building distributed services ranging from cluster-based services to wide-area services. Many of these services have addressed the important problem of maintaining data consistency in the face of distributed data access. Distributed application developers have traditionally employed some combination of two distinct programming paradigms namely, message-passing and shared data. Applications using the message-passing paradigm explicitly partition shared state among components and perform all data management themselves via explicit communication and only utilize the system's support for communication. This approach results in a more efficient implementation, but leaves the burden of data management on the application programmer. In contrast, shared state-based applications assume an underlying shared data management system and layer application logic on top, while leaving data replication and consistency issues to the system. This approach relieves programmer burden by handling all the hard issues of data management behind a familiar programming interface. However, it often suffers from poor performance due to: 1) mismatch between the techniques employed by the system and the application's needs, and 2) lack of a flexible interface for communication of these needs between the application and the system for better adaptation. To avoid this inefficiency, most distributed service developers still build custom data management solutions on top of a message-passing infrastructure, and hardwire service-specific knowledge into the data management logic to achieve good performance. For the purpose of this thesis, we examine a specific aspect of this problem, namely consistency management of shared data, and see how much support is available for distributed applications in this regard.

Previous systems in this area can be broadly classified into three categories: systems that provide flexible consistency management, systems that invent novel consistency guarantees to suit specific data access patterns, and systems that explore efficient techniques to implement specific data coherence schemes over the wide-area in the context of individual applications. In this Chapter, we examine existing systems in these three categories closely to give a perspective for our research. Our work builds upon the insights gained from observing these systems and the applications using them.

Flexibility in consistency management has often taken the form of supporting multiple selectable consistency options for application data. Munin [3] provides flexible/tunable consistency policies in a LAN environment for distributed shared memory-based parallel programming. WebFS [36] provides a useful set of coherence policies for wide-area file access and update multicasts. Fluid replication [5] provides intelligent data placement and multiple consistency policies for mobile file access across the wide-area. TACT [39] provides optimistic consistency with application-specified notion of coherence in a wide-area environment. The flexibility in consistency options provided by these systems enables their reuse for a variety of applications. However, these systems are limited in the hooks they provide to let an application customize the implementation of their policies. As we will demonstrate in our evaluation, the customizability of the underlying implementation (e.g., pull- vs. push-based update propagation) is important for the performance of some applications. Moreover, these systems do not adequately address the issues of dealing with variable network latencies and/or scalability (to a large set of clients) typical of the wide-area in consistency-related communication. In our work, we propose to build on their success by providing similar flexibility while dealing with some of the wide-area issues listed above.

Providing data coherence across the wide-area while dealing with variable network latencies, disconnections and partitions is not new. However, previous work has been done in the context of specific data access patterns. AFS provides last-writer consistency in a wide-area file system with traditional Unix-style file access. The Coda file system [20] provides support for weakly connected and disconnected operation in a campus environment employing a client-server architecture. Ficus [29] provides replicated file access while handling intermittent connectivity in a more peer-to-peer setting. Bayou [7] provides optimistic consistency for a replicated database in a peer-to-peer mobile environment with highly intermittent connectivity and unstructured communication patterns. It requires application-specific conflict resolution procedures to be supplied. As it relies on ad hoc communication patterns among mobile replicas for update propagation, replicas can take unbounded time to synchronize. In contrast, we propose to employ a hierarchical replica topology to achieve more bounded synchronization. Blaze's PhD thesis [1] showed the value of constructing dynamic cache hierarchies on a per-file basis among clients to support efficient large-scale file sharing and to reduce server load in distributed file systems. These systems have developed many effective techniques to achieve performance and scalability in a wide-area environment. We propose to leverage some of these techniques in a more general-purpose consistency management framework for the benefit of a wider variety of applications.

Numerous consistency options have been developed individually to handle the data coherence needs of specific services such as file systems, directory services [24], databases [26] and persistent object systems [22,9,38,15]. Many algorithms have been proposed to maintain consistency among data replicas for fault-tolerance in clustered as well as wide-area environments in the face of network partitions (e.g., quorum consensus [12], Deno [19]). The goal of our work is to design a framework in which these and other algorithms can be reused by a variety of applications. Saito presents a comprehensive survey [32] of optimistic consistency schemes with a taxonomy for classifying them based on their implementation choices. We have extended his taxonomy to include pessimistic algorithms also, and used it to arrive at a set of tuning parameters to provide flexible consistency management, which we describe in Chapter 3.

Persistent shared object systems such as Thor [22], PerDis [9], Globe [38] and Legion [15] provide object caching support with serializability for transaction semantics. The OceanStore [21] project aims at providing a secure, global scale persistent data utility. Objects in OceanStore are immutable and consistency is maintained based on versioning instead of in-place updates. They employ update procedures to apply and propagate updates among replicas, similar to Bayou.

Recent popularity of Internet-based file sharing (e.g., Napster [18]) has prompted research into peer-to-peer organization of distributed applications, in which any component can dynamically take on any role (e.g., client or server) in the system. Much of this work has focussed mainly on anonymous information dissemination (e.g., Freenet) and/or immutable deep archival storage among peers as opposed to handling concurrent updates to replicated data. Pastry [30], Chord [35] and Tapestry [40] provide efficient object location and request routing across the wide-area based on hashed object IDs. PAST [30] and Farsite [2] provide cooperative file storage and retrieval, but do not address consistency among multiple replicas in the face of updates. Our consistency management work is complementary to these systems, as it can add consistent update functionality to them, making their benefits applicable for a wider range of services.

The application classes that we target have been explored individually by different systems. In particular, applications based on wide-area data dissemination and exchange currently employ message-passing using some form of multicasting support such as MBone [23] or IP multicast [6]. Our approach to achieve data multicast as a side-effect of update propagation among data replicas, though not as efficient as these highly tuned solutions, has several benefits. It allows intermediate data staging along the multicast path to cache data closer to its clients. It enables the quality of multicast-ed data and frequency of update transfer to be customized at intermediate hops to the consistency level needed by individual replicas, unlike IP multicast, where data quality is controlled only at the sender, not at intermediate hops. Scribe [31] and Bayeux [41] construct multicast trees for efficient data/event dissemination among data replicas, by virtue of being built on top of topology-aware data location and routing schemes such as Pastry and Tapestry (respectively). Our focus is on making use of a multicast tree efficiently, once it is built, by employing data staging along the tree nodes.

In summary, our consistency management framework enables efficient reuse of a variety of consistency management techniques and options developed by prior research, by providing them to applications in a wide-area environment with significant control over their implementation.

3. Issues in Flexible Consistency Management

In this Chapter, we examine the issues in providing data consistency and availability in a general-purpose manner across the wide-area.

3.1 The Consistency vs. Availability Tradeoff

A common technique to increase data availability and reduce access latency in a distributed service is to cache/replicate data on nodes close to its access. However, concurrent updates to these copies3.1 requires a mechanism to keep them coherent. Fox et al. [10] postulate that there is a fundamental tension between the degree of data consistency (C) and availability (A) achievable in an application and its ability to handle network Partitions (P). They called it the CAP principle. According to this principle, we can get at most two out of these three properties for a given piece of data. P is important for wide-area applications. Hence, if we need high availability in a wide-area application, we have to degrade its consistency requirement appropriately by choosing a weaker guarantee policy for its data. We cannot achieve these properties independently. Regarding disconnections and partitions, we have to refer back to the CAP principle. It is up to the semantics of a given consistency policy to decide what level of C and A can be provided in a partition situation. For instance, if an application needs strict consistency among copies across a wide-area, it must employ some voting/quorum algorithm such as Deno [19] to restrict data availability in a partition situation. Fortunately, not all Internet services need such strict guarantees; many of them can live with weaker guarantees, thus allowing a wide spectrum of choice in the degrees of C, A and P in practice.

In this thesis, our focus is on providing a wide-area framework for supporting a variety of consistency policies, each of which provide a specific combination of degrees of consistency, availability and behavior in intermittent connectivity situations. In particular, we propose to explore the issues in actually implementing multiple policies in the wide-area, exploiting any common mechanisms they can take advantage of, and providing application control for effectively dealing with variations in network quality.

A consistency policy is a contract between the application and the system regarding data access. If the application adheres to system-imposed conventions for data access, the system guarantees to maintain certain invariants about the quality of data accessed. A consistency mechanism is a set of actions taken by the system's components to enforce this contract. These actions include the following: 1) controlling admission of access and update operations on data replicas, 2) propagation of updates among replicas to maintain desired invariants about data quality, and 3) handling conflicting updates (if any).

3.2 A Taxonomy for Classifying Consistency Schemes

Consistency mechanisms vary with respect to when they allow access to a replica, when and how they propagate updates to synchronize replicas, and how they detect and resolve update conflicts (if they arise). The right policy for each of these design issues depends heavily on the application's data access patterns. Hence, application control over these issues is essential for the efficiency and scalability of the resulting system. In this Section, we classify consistency mechanisms along a few aspects listed below. For this purpose, we extend the taxonomy described by Saito [32] by several dimensions.

3.2.1 When is access allowed?: Access control

This determines whether an access/update request is satisfied after or before bringing it to desired consistency level. Pessimistic schemes (either lazily or eagerly) propagate updates to achieve desired level of coherence before allowing access. Optimistic schemes allow immediate access to any data replica, but perform update propagation after data access in the hope that conflicting updates are rare and/or can be easily resolved. With optimistic schemes there is a risk of stale reads and conflicting writes, as updates may not propagate in time to prevent conflicting access on other nodes. The need for conflict detection makes them more complicated to implement than pessimistic schemes.

Pessimistic consistency schemes are attractive when updates are relatively frequent but cannot handle disconnections, whereas optimistic schemes provide high availability in the face of disconnections but perform poorly when update conflicts are frequent. One possible strategy (adopted by Coda [20]) is to employ pessimistic consistency during periods of good connectivity, while switching to an optimistic scheme at other times.

3.2.2 Where are updates issued?: Update transfer model

This determines where an update can be issued. Single-master systems statically designate a master replica where all updates should be made, and are propagated to others (e.g., DNS [25], active directory schema [24], web content). These are simple to implement, butthe master is a single-point of failure. They are suitable when consistency is more important than availability. Multi-master systems allow updates to be issued at any replica, and coordinate to exchange each other's updates (e.g., active directory data [24], AFS [17], Bayou [7]). They may involve coordination of multiple replicas, are more complex and may introduce update conflicts in an optimistic setting. Hence they are more suitable when data availability is more desirable than consistency.

3.2.3 Who triggers update transfer?: Direction of transfer

Pull-based schemes make each replica responsible for polling others for updates (e.g., Coda, WWW), whereas push-based schemes make a replica with a pending update responsible for delivering it to other replicas (e.g., Usenet [34], Porcupine [33], Active Directory [24]). Pull-based schemes are suitable for read/write data sharing environments involving large scale but infrequent read-sharing. Push-based schemes are more suited to frequently read data and for data distribution.

3.2.4 When is update transfer triggered?: Update transfer condition

This determines the condition under which update propagation is initiated. It may be user-initiated (e.g., when user presses ``refresh'' button), timer-driven (e.g., every 3 hrs in active directory), topology-based (i.e., frequency decided based on communication costs and connectivity situation), data-driven (i.e., based on tolerated staleness of data as in TACT [39]; this is application-specific), eager (i.e., upon update arrival), or lazy (e.g., when an access request arrives).

3.2.5 Discussion

All the dimensions in the above classification are largely orthogonal to each other, and most existing consistency mechanisms adopt a combination of options in each of these dimensions. Hence the above dimensions naturally serve as effective consistency control parameters. By allowing applications to specify various options along these dimensions, the system can tune its implementation and achieve good application performance.

3.3 Consistency Needs of Targeted Application Classes

In this Section, we examine the consistency management needs of the three application classes mentioned in Section 1.1 that we target for our thesis. Examples of the kinds of shared data applications that we are targeting include distributed file service, directory services, the WWW, and distributed simulations. Such applications typically have significant locality to exploit and hence can benefit from caching. However, their data access patterns (e.g., relative frequency and spread of read and write operations) and availability requirements vary widely. For instance, system binary files and static web content are infrequently but widely read, rarely updated and can tradeoff consistency for availability. Hence single-master optimistic consistency with pull-based update propagation works well for these files. System configuration files have similar properties, but with one difference: they are frequently read. A pull-based update propagation approach overwhelms the master with frequent pull requests from many clients. Hence for these files, a master-initiated push through the replica topology performs and scales better than the pull-based approach. Similarly, the availability needs of data in a partition situation varies from one type of data to another. Directory schema updates should be disallowed in all but one partition, whereas read-only access is more desirable for shared project files. These variations can be enforced with optimistic vs. pessimistic consistency schemes.

Data dissemination involves one-to-many communication of data such as is commonly performed as part of multimedia streaming, stock ticker distribution and distributed sensor logging. Data exchange involves many-to-many communication among nodes such as in the case of internet-based chat rooms, Usenet newsgroups, internet games. Both of these application classes need an efficient data dissemination mechanism from producers to registered consumers and hence can benefit from system support for asynchronous update propagation based on multicasting. An intermediate network of data caches can help absorb the effects of transient disconnections and variable network latencies. A single-master approach suits the former class, while a multi-master approach provides better update latencies for the latter class. Some of these applications, such as remote status monitoring, need control over the quality of reported data on a per-client basis to suit individual clients' bandwidth availability. For this they need control over update propagation strategies (such as timer-driven and staleness-driven propagation).

In summary, the diverse consistency requirements of these classes can be supported well by the combination of a few consistency policies with control over their implementation along the dimensions that we identified above.

4. Khazana

In this Chapter, we describe the design of a prototype wide-area data store called Khazana that we are developing as a test-bed to evaluate our thesis. We first give an overview of Khazana and move on to describe its consistency management subsystem in detail, which is the focus of our thesis.

4.1 Overview

Khazana exports flat files called regions4.1 and provides session-oriented read/write access to these files at page-granularity. Khazana is organized as a collection of cooperating peer data server processes called Kservers. Each Kserver utilizes a portion of its persistent local store for permanent copies of some files and the rest as a cache of remotely stored files. Administrators can run Kservers on many machines across the wide-area. Kservers discover each other through some external means such as a directory service and/or as a side-effect of file lookups. Each Kserver monitors its connection quality (latency, bandwidth, connectivity) to other Kservers with which it communicates and uses this information for controlling its protocol-related communication. Applications access Khazana files through a library called Kclient that encapsulates all communication with Kservers while exporting the Khazana session interface.

A file's permanent copy site is called its home node; there may be many cached copies (secondary replicas) at other nodes. A Kserver creates a local file copy when a client contacts it to access a file that it does not store locally.

4.1.1 File Naming and Location Tracking

Khazana files are named by unique numeric IDs (called KIDs) assigned to them by their home node at the time of their creation. To simplify file location and management of ID space for the purpose of our prototype, we employ a simple scheme wherein each Kserver manages its own local ID space and a file's KID is a combination of its home Kserver's network address and its ID within the Kserver's local ID space. A Kserver finds a file's permanent copy site based on the address hard-coded in the file's KID. We choose this scheme mainly for its simplicity and because naming and location management are not the main focus of our thesis. Khazana's design allows this scheme to be easily replaced by more sophisticated file naming and dynamic location tracking schemes such as those of Pastry [30], Chord [35] and Tapestry [40], without affecting the contribution of our thesis.

Even though a file can be always located by the home node address hard-coded in its file name, each Kserver maintains a cache of hints to quickly locate nearby Kservers containing copies of a file, before contacting its home node. This reduces access latency if the requesting Kserver is weakly connected to the file's home node relative to other copy sites. Our idea is analogous to cooperative proxy web caching [11], but is intended to work for write-shared data as well.

4.1.2 Creating and Destroying Replicas

When a Kserver queries another Kserver for a file copy, the responder can either supply the requestor a copy itself, or supply a list of other replicas to query in turn (e.g., if it is overloaded). In the latter case, the querying Kserver sends its request to one of these sites based on its perceived connection quality to them (by contacting nodes with high connection quality first). As a side-effect, the supplier Kserver adds the requestor to its list of child copy sites and the requestor makes the supplier its parent. This replica creation mechanism dynamically forms a hierarchical network of replicas rooted at the file's home node, similar to the dynamic hierarchical caching introduced by Blaze [1]. This network tends to be organized such that nearby nodes are directly connected to each other. Alternatively, administrators could also specify more optimal topologies to be used. The fanout of any node in the hierarchy (i.e., number of child copy sites) is limited by the load-handling capacity of that node. When a link or node in the hierarchy goes down, the orphaned nodes try to attach themselves to the hierarchy again, starting at a known copy site (home node by default), effectively repairing the replica hierarchy. A Kserver is free to delete locally cached copies of files to reclaim local store space at any time, after propagating their updates to other copy sites.

4.1.3 Khazana Interface

Khazana exports the following operations to its clients via its library interface:

alloc_region(page_size, other_attributes):
In response to this request, a Kserver allocates a region with given page size and attributes (described below), and returns its KID. Applications should supply this KID for subsequent operations on the file.
This operation destroys all the copies of the region, frees the region's contents and makes the KID available for reallocation.
get_attr, set_attr(KID, region_attr):
This operation can be used to set various region attributes such as region size, consistency policy and various consistency control parameters on the region as a whole, which affect all replicas of the region. We describe the various consistency options and control parameters that we propose to support, in Section 4.2.
get_copy_attr, set_copy_attr(KID, copy_attr):
This operation sets various consistency-related attributes on the receiving Kserver's copy of the region, creating a copy if it doesn't exist. Clients can use this operation to specify how updates to this copy need to be propagated and how often this copy should be synchronized with other copies. They override the generic region attributes set using the set_attr() operation above. We describe these per-copy attributes in Section 4.2.2.
open_region(KID, offset, size, mode):
This starts an access session for the given portion of a region in specified mode, and returns a session ID. A size of zero indicates whole-file access. In response to an open request, the receiving Kserver first obtains a copy of the region portion and then performs access control before allowing access to the region. As part of this, the consistency manager brings the local copy up-to-date based on the consistency policy set for the region.
read_region, write_region(session_id, offset, size, data):
In response to these calls, region data is transferred to/from client buffers. These calls do not trigger any further consistency management actions.
snoop_region(session_id, update_callback):
This operation registers the caller to be notified whenever a remote update arrives to a Kserver's local copy of the region portion opened as part of given session. When an update arrives, the supplied callback routine is invoked specifying the range updated. A request with a null callback removes a previous registration. All registrations are terminated when the session closes. This call can be used by clients to receive remote updates asynchronously.
This call terminates the access session and the consistency management system can perform update propagation if required by the region's consistency policy and the control parameters set for this replica.

4.2 Consistency Management

Khazana's consistency management (CM) subsystem is responsible for maintaining coherence among file copies created as described above. Khazana provides user choice over a set of consistency policies on a per-file basis, and over the various mechanisms to be used in implementing these policies. All the consistency schemes utilize a common set of underlying update propagation mechanisms that operate via the replica network. These mechanisms only affect the performance but not the guarantees provided by the consistency policy. The decision on when to allow read/write access and propagate updates thus depends on the choice of consistency policy and the implementation strategy. In contrast to our approach, other existing systems (such as WebFS [36] and Fluid replication [5]) bundle these mechanisms as part of the implementation of each individual policy, and do not provide application control over their selection.

4.2.1 Supported Consistency Policies

For the validation of our thesis, we plan to support the following consistency policies on a per-file basis in our prototype, and evaluate their suitability for a variety of applications in our targeted application classes. None of these policies is new, nor is our design restricted to use only these policies. Alternate policies can be implemented using our consistency framework while still using its basic mechanisms. All replicas of a region enforce the same consistency policy set for the region in contrast to systems like Fluid replication [5] that enable different clients to see the same file at different consistency levels.

Strict pessimistic consistency:
Only one writer is allowed at any time. A reader always gets the latest value written. This implies that in a partitioned network, access is disallowed in all but one partition (that holds write privileges). We also implement a minor variation of this, which allows read-write access in one partition, but read-only access in others.
Optimistic last-writer-wins consistency:
Reads and writes are allowed on any replica, any time [5]. Updates are propagated based on control parameters chosen. Write conflicts are resolved along the way to the root replica in the hierarchy. The write that reaches the root node of the copy hierarchy last wins and is propagated back to all nodes.
Append-only consistency:
Appends are allowed on any replica, any time [36]. They are never lost. Append conflicts (concurrent appends to the same version of a file) are resolved by applying both appends in the order they arrive at a common parent node in the replica hierarchy.

4.2.2 Supported Consistency Control Parameters

We provide the following ways for applications to guide consistency implementation at a given replica site. Some of these parameters are settable on a per-region basis and affect all replicas. These can be set using the set_attr() interface operation described in Section 4.1.3. Some of them can be set on a specific replica and affect only that replica. They can be set using the set_copy_attr() interface operation. For explanation of these parameters and their values, please refer to Section 3.

Update transfer model:
We provide two options on a per-file basis: single-master and multi-master. In contrast, Blaze [1] supports single-master updates.
Direction of update transfer:
We offer two alternatives on a per-replica basis: push-based and pull-based.
Update transfer condition:
We offer several update transfer options on a per-replica basis: user-initiated, timer-driven (with specified timeout), topology-driven, eager (as soon as an update arrives) and lazy (e.g., upon access request). We also offer two special options driven by different notions of staleness: maximum time elapsed since last change, and number of unpropagated updates tolerated.
Each of our targeted applications can work with multiple combinations of the above options. However, some combinations perform better than others. We discuss these combinations in more detail as part of our thesis evaluation in Chapter 5.

4.2.3 Update Propagation

To aid in update propagation and detection of update conflicts, each replica maintains the following persistent state (common to all consistency policies) similar to the versioning state maintained in Bayou [7]:

In addition to the above state, a replica site also maintains the state of ongoing local sessions and any asynchronous update registrations, which are non-persistent. The amount of persistent state to be maintained at a replica site is thus proportional to the number of other replicas with which it directly communicates (i.e., the node's degree in the replica hierarchy), not the total number of replicas. This, coupled with the fact that replicas are restricted to communicate only along the hierarchy, ensures scalability to a large number of replicas. In contrast, systems such as Ficus [29] and Bayou [7] that employ more peer-to-peer replica topologies have to maintain state that is proportional to the total number of replicas, theoretically limiting their scalability.

The consistency manager at a replica site gets activated just before opening a session to perform admission control and to bring the local copy up-to-date for access, and just before closing the session to propagate updates (based on its propagation strategy). A replica is said to ``commit'' its local updates when it successfully sends them to its parent. How often and under what conditions a replica propagates local updates to its neighboring replicas (parent and children) depends on the consistency policy and the control parameters set for the replica. However, all the policies we implement use common underlying mechanisms for update propagation, which we describe in the following Sections.

Propagating Updates to a Parent Replica

A replica R has uncommitted updates if its local_version is higher than its parent.version_sent. To commit local updates, R sends the updates to its parent along with its parent.version_recved and local_version. The parent compares incoming parent.version_recved with its own local_version. If local_version is higher, then this indicates an update conflict, as the parent already received a previous update on the copy that the child R has independently updated. To incorporate the remote update, the parent receives and applies the update locally, increments its own local_version and saves it in R.version_sent. It then stores R's local_version in its R.version_recved, and sends its own new local_version to R. R stores this number in its parent.version_recved, completing the update commitment from R to its parent.

Propagating Updates to a Child Replica

A parent replica has outstanding local updates with respect to a child replica C, when its local_version is higher than its C.version_sent. When the parent decides to propagate outstanding updates to C, it sends the updates along with its own local_version and C.version_sent. C compares the incoming C.version_sent with its own local_version. If the local_version is higher, it indicates an update conflict, as C has some uncommitted updates, whereas its parent has independently updated the region. To incorporate the incoming update , C receives and applies incoming update on local copy, increments its own local_version, and saves it in parent.version_sent. It then stores parent's local_version in its parent.version_recved, and sends its own new local_version to parent. The parent in turn stores this number in its C.version_recved, completing the update propagation from parent to C.

Once a child replica's updates have been accepted by its parent, the parent treats those updates as its own for the purpose of propagating them to other replicas. The child should not propagate those updates to any other new parent that it contacts later (due to disconnection with old parent and re-attachment to replica hierarchy) Otherwise, they will become duplicate updates.

Reconnecting to the Replica Hierarchy after a partition

When a replica R loses connection to its parent and needs to re-attach itself to the hierarchy, it looks up for a new accessible replica P using Khazana's region location mechanism described in Section 4.1.1. It then sets P as its parent, obtains a fresh new copy from P (along with its version number) and re-applies its local uncommitted updates (if any) on this copy. Otherwise, if R just propagates its local updates to P as described above, there is a chance that R propagates updates that it received from its old parent to its new parent, causing duplicate updates. There is no need to re-apply updates when operation logging is not being performed. Handling Update Conflicts

An update conflict arises when two replicas are independently modified, starting from the same version. Pessimistic locking-based consistency schemes prevent such conflicts by delaying conflicting operations, whereas optimistic schemes allow them, but employ versioning to detect and resolve conflicts. In our update propagation scheme, a conflict can be detected either while committing a child replica's updates at its parent, or when applying a parent's committed updates at a child replica site.

The action to take when a conflict is detected is consistency policy-specific. Conflict resolution can either require manual intervention, rely on resolution rules enforceable by the system itself (e.g., last-writer wins, or using merge procedures), or need application-specific techniques. The rule-based approach is simple to implement, but may not successfully resolve all conflicts. In case of the last-writer-wins optimistic policy, the incoming updates are incorporated into local copy. In case of the append-only consistency policy, the updates are applied such that parent's updates appear first and the child's uncommitted updates appear next.

Controlling Update Propagation

The frequency of propagation of updates from a replica is partly determined by the consistency policy and partly by the control parameters set for the replica. In case of pull-based update transfer, updates at a replica are sent to others only upon explicit request. In pessimistic consistency schemes, this update request is issued before allowing read access to a replica, whereas in optimistic schemes, this is performed periodically at a frequency settable on a per-replica basis via the control parameters described in Section 4.2.2. When a client registers for asynchronous notification of updates to a replica, the replica site in turn sends a registration request to its parent to push updates to it based on a transfer condition set on the replica in a previous set_copy_attr() call. The parent can in turn forward this to its other neighbors and so on. Later, when an update arrives at a replica site, it propagates the update to all registered neighboring replicas (i.e., parent and children) whose update transfer condition is satisfied.

4.2.4 Implementation of Consistency Policies

We implement the pessimistic strict consistency policy in Khazana as follows. At each replica site, we maintain an access token that indicates what operations (among read, update and append) are allowed on that replica. Strict consistency requires that only one replica hold the update access token at any time. In a single-master implementation, one replica site (the home node in our prototype) always keeps the update tokens and other replicas forward all update requests to this replica. In a multi-master implementation, this write token gets circulated among the replicas based on write requests arriving at their sites. Multiple replicas can hold read tokens, but they are all revoked (via the replica hierarchy) before allowing write access. When a replica site transfers its write token to another replica, it also triggers propagation of its updates to the receiver of the token, via the replica hierarchy.

We implement the optimistic policies (last-writer-wins and append-only) in Khazana as follows. All read operations are allowed to proceed immediately on all replicas. Updates requests are forwarded to to home node in case of single-master approach. When an update is made at a replica, it is propagated to others based on the update transfer condition set at various replica sites. Update conflicts are resolved as described in Section 4.2.3. Append operations are logged explicitly as they need to be re-applied on a replica during conflict resolution.

4.3 Discussion

In this Chapter, we described a prototype wide-area data store called Khazana, that we are building to evaluate our thesis. Our design splits up consistency management into a collection of core mechanisms that serve as building blocks for composing a variety consistency schemes to suit specific application needs. Our hierarchical organization of replicas and multi-level update propagation and conflict resolution using the hierarchy limits the amount of state that needs to be maintained on a replica site, making our implementation scalable to a large number of replicas per object.

5. Proposed Evaluation

5.1 Success Criteria

The success of our thesis lies in demonstrating that flexible consistency management over the wide-area has the following properties:

  1. It is viable, i.e., we can implement a wide-area data management system with the following properties:

    1. Failure resilience: It can handle node/network failures, disconnections and partitions (intermittent connectivity) gracefully. However, the availability of an application's data in such situations is determined by the consistency guarantees desired, and is thus application-dependent.
    2. Scalability: It can scale reasonably well with workload along several dimensions: 1) number of clients, 2) number of servers, 3) number of objects, 4) number of replicas per object, 5) object access rate (operations per sec), and 6) variability of network quality (latency and bandwidth).

      In our prototype, we propose to evaluate our system's scalability to about 512 clients served by up to 128 servers (this limits the number of replica per object to 128), with about 1000 objects (files) per server, over a simulated WAN spanning up to 10 LANs connected by variable speed links into various topologies. We choose these limits considering the resources available in the Utah network test-bed [8] which we plan to use for our experiments. We also plan to conduct a wide-area experiment with up to 5 servers geographically distributed across the U.S., to evaluate our system's performance in real wide-area conditions.

      We will measure the access rate sustained by our system at various scales.

  2. It is useful, i.e., our approach to consistency management improves upon existing support for distributed applications in the following ways:

    1. It enables the reuse of consistency management techniques across a variety of wide-area application classes.
    2. It will support multiple coexisting consistency policies, which will let applications meet the specific consistency, availability and intermittent connectivity handling needs of individual data items. This support will also let an application writer evaluate the suitability of various options for a given data item, without significant implementation effort.
    3. It provides flexibility in choice of implementation techniques. This flexibility helps an application achieve good performance and scalability by enabling it to tune the consistency management implementation for specific data usage patterns.
    4. It gives an application control over update propagation strategy on a per-replica basis, which helps meet the diverse data quality requirements of different clients for a given data item within an application.
In order to validate these claims, we propose to develop and evaluate four applications with distinct data access patterns and consistency requirements using Khazana, that are representative of three classes of applications namely those involving 1) read/write access to shared data, 2) data dissemination and 3) data exchange among distributed components over the wide-area. We plan to conduct a series of measuring experiments with these applications to validate each of our claims.

5.2 Sample Applications

In the following Sections, we describe four representative Khazana-based applications and the set of experiments we propose to conduct to validate the above claims. The data coherence needs of these application vary widely. We will show that the various combinations of policies and implementation techniques that we provide are sufficient to support these needs well. We will also demonstrate how some of the consistency policies are useful for more than one application. These together validate our reusability claim 2(a).

5.2.1 DataStations

DataStations is a distributed file service that uses Khazana's caching and consistency management to provide coherent read/write file sharing across the wide-area. This application belongs to the data-sharing class. For our thesis, we focus on supporting files with five different access patterns: type 1 - infrequently updated by a few clients, and infrequently read by many (e.g., system files); type 2 - infrequently updated by a few clients, but frequently read by many (e.g., configuration files); type 3 - frequently read/written by a single mobile user, but rarely shared among multiple users (e.g., personal files); type 4 - frequently updated among a few users, but read by many (shared project repository files such as those in a CVS repository); type 5 - widely read and updated append-only files (such as log files). An application can individually set different consistency policies and control parameters for a Khazana file to suit its behavior, any time. Khazana does not infer these parameters by observing the file's usage patterns. Experiment 1: Multiple Coherence Policies in a Single Wide-area System

With this experiment, we propose to demonstrate how our flexible multi-policy consistency management system can support access to a large number of files with the variety of access patterns described above, and evaluate its scalability. We employ single-master optimistic consistency with pull-based updates for type 1 files and push-based updates for type 2 files. We employ a multi-master pull-based pessimistic consistency with token-based read/write access for type 3 files. We employ multi-master pull-based pessimistic consistency with lock-based writes for type 4 files, and multi-master pull-based optimistic append-only consistency for files of type 5. The consistency policy and control parameters for a Khazana file can be set and/or changed any time by using its set_attr() and set_copy_attr() interface operations.

Our experimental setup will consist of 128 servers spread across 10 LANs. We will create many files per server uniformly belonging to the 5 types listed above. Clients spread uniformly across these LANs perform random read and write operations on these files, adhering to the usage patterns described above for each type of file. We will measure the average access latency for read and write operations for each type of file and the average throughput (operations per second) perceived by clients for various numbers of clients from 2 to 512 in increasing powers of 2. We expect the variation of access latency with load for each type of file to be reasonable and not to be drastically affected by simultaneous activity on other types of files. This evaluates the overall scalability of our system with number of clients.

We will repeat this experiment with various numbers of files per server from 8 to 1024 in powers of 2 and measure the access latencies as above. This evaluates the scalability of our system with the number of files. Experiment 2: Customizing Implementation Helps Performance

The techniques needed to efficiently maintain consistency varies with the file's usage patterns. For instance, since files of type 1 are typically written by a few clients (e.g., administrators), a single-master optimistic approach to consistency works well. Also, since they are infrequently read by many users, a pull-based mechanism for propagating updates to readers suffices. On the other hand, type 2 files slightly differ from type 1 files in that they are frequently (continually) read by a lot of users (e.g., consider configuring 1000's of workstations in corporate Intranets). A pull-based mechanism in their case overwhelms the single-master with pull requests from many readers, much of the time. Hence for type 2 files, a master-initiated push through a hierarchical replica topology performs and scales better than pull-based approach.

We propose to demonstrate this as follows. We create a file of type 1 on a Kserver and perform updates to it from a few clients randomly (once every 5-10 sec), while randomly reading it (once every 30-60 seconds) from a set of clients on other LANs. We measure the average access latency and overall bandwidth utilization for read and write operations when employing push vs. pull-based update propagation using single-master optimistic consistency. We do this while increasing clients in powers of 2, from 8 to 512. We repeat the experiment with the type 2 file being read by each client once every second. We expect the bandwidth utilization to be lower for the pull-based approach in case of type 1 file and for the push-based approach in case of type 2 file. We expect the gap in utilization for the two approaches to increase with the number of clients. This experiment validates our claim 2(c). Experiment 3: Multiple Consistency-Availability Options during Intermittent Connectivity

Different types of files have different combinations of requirements for consistency(C) and availability(A) in the presence of intermittent connectivity and partitions(P). In fact, the CAP principle (stated by Fox et al. [10]) says that at most two out of these three properties can be achieved for a given piece of data.  The level of availability that can be achieved in intermittent connectivity environments is limited by the strength of consistency guarantee needed by application data. To support a variety of combinations of C and A requirements, the system needs to provide consistency policies with a variety of consistency and availability combinations. Many choices have been explored by prior research. For the purpose of our thesis, we describe three useful C-A choices with distinct behavior in the presence of network partitions and show that they can be supported as part of our consistency management system. They are: option 1 - allow unrestricted read/write access to data within a partition, and resolve update conflicts either with a simple rule, or refer them to user; option 2 - allow read-only access in all partitions, but read-write access in at most one partition; lastly; option 3 - disallow access in all but one partition. Option 1 is suitable for files of type 3 and 5 and can be achieved with multi-master optimistic consistency. Option 2 can be enforced using a slight variation on multi-master pessimistic lock-based consistency that allows write access in the partition containing the write privilege, and is suitable for files of type 4. Option 3 is provided by strict consistency policy with read and write locks, and is suitable when either stale data is useless (e.g. game state in a distributed multi-player game), or it is important to ascertain internal integrity of data before allowing access (such as in databases, data structures like hash tables and B-trees). It is suitable for our directory application described in the next section.

We demonstrate the variety of options for disconnected operation by repeating experiment 1 with a fixed number of files and clients, but forcing inter-LAN link failures and node failures. We expect to demonstrate that option 1 files can still be accessed in unrestricted manner, option 2 files can only be read, while option 3 files cannot be accessed at all. We show this by indicating a very large access latency for inaccessible operations in our graphs. We show how our system behaves when the links and nodes come back up.

5.2.2 Distributed Directory Service

In its essence, a directory stores mappings between (fixed-length) names and values and exports four operations: add, delete, modify, lookup. This service falls under the data-sharing class and can be used as a building block for a variety of useful applications such as user profile databases and text search indexes. We will implement a prototype directory as a fixed-size page-based hash table inside a Khazana file. The directory is widely shared and all operations are frequent. Users access the directory through a library that manipulates its cached copies of the shared file's pages using the Khazana interface. Consistency among the cached copies of the directory's pages can be maintained using Khazana's pessimistic strict consistency policy or an optimistic last-writer-wins policy with operation logging. Performance varies depending on how widely the directory is shared and updated. The latter policy is complex to implement, but provides better concurrency and allows large-scale sharing. For our prototype directory implementation, we will employ a multi-master strict pessimistic consistency scheme with pull-based update propagation among replicas. Experiment 4: Distributed Data Structure with Strict Coherence

With this application, we propose to demonstrate how our consistency management system can be used to support widespread cached access to a page-based data structure in a way that provides strict coherence guarantees. To show this, we create a single directory and perform various operations on it from multiple clients in different LANs, each of which operate on the directory through a nearby Kserver (at most one per LAN). Kservers cache the directory's pages as a side-effect of access. Each client performs various mixes of 1000 operations at the rate of 4 or 5 operations per minute. The first mix consists of relatively large number of additions, to populate the directory. The second mix consists of a lot of lookups spanning many LANs, with relatively fewer updates. The third mix contains an equal proportion of all operations spanning many LANs, but with clients exhibiting significant locality within a LAN. This last mix illustrates typical usage scenario for a directory in an Intranet spanning multiple campuses. For each mix, we measure the average response time and bandwidth utilization per operation, while increasing the number of clients from 2 to 128 and creating replicas as a side effect. We expect the directory's performance to scale reasonably well for the second and third cases, but not for the first one. This is inherently due to the consistency policy. Scalability in the third case is due to the aggressive caching of working sets of directory pages in various LANs avoiding WAN communication.

We will repeat the experiment an existing directory implementation such as Microsoft's Active Directory [], configured with the same number of directory replicas as Khazana-based version and compare their performance. Existing directories employ static replication, but not dynamic caching. Although we cannot expect our untuned prototype to perform as well as the existing highly tuned directory in absolute terms, we expect performance of our version to be within a reasonable factor of that of the existing version. However, as clients exhibit significant locality, the aggressive caching employed by our implementation helps improve the directory's response time with increasing load. Also, implementing our directory requires much less effort by virtue of being Khazana-based.

5.2.3 Chat room

In addition to traditional read/write data access, the flexibility of our consistency management system allows it to efficiently support distributed data/event logging and many-to-many data exchange as well, across the wide-area. To demonstrate this, we implement an Internet chat room as concurrent appends to a single transcript file in Khazana employing its append-only consistency scheme. The file containing the chat transcript can be replicated sufficiently to provide low-latency nearby access to its widely dispersed chat clients. We employ multi-master append-only optimistic consistency with push-based update propagation along the replica topology to achieve efficient multicast of appends. Clients register at a replica site to notify it of remote updates asynchronously. Experiment 5: Scalable, Real-time Data Exchange

This service tests the latency and efficiency of real-time data propagation in Khazana. We evaluate the responsiveness and scalability of the chat room by starting clients on multiple LANs. Each client registers for asynchronous updates to its copy and writes to the chat room randomly once every 30 seconds. We measure the performance of the chat room in terms of the time taken for an append to be accepted by the system, the minimum and maximum taken for an append to reach another chat client. and the bandwidth utilized for the chat session. We measure this while gradually adding clients (increasing from 2 to 512) on more and more LANs (up to 10).

We repeat this experiment with the single-master non-hierarchical caching and consistency scheme and measure the above parameters. This approach is similar to that of existing chat implementations, but uses more bandwidth. We also repeat the experiment with an existing single-server-based popular chat implementation such as IRC [4].

We expect to show that propagation time is reasonable when compared to a single-master implementation as well as IRC, but the bandwidth utilization is much better. Propagation time does not change much as more clients are added to a LAN, but gracefully degrades as clients spread across LANs. Also, we expect to show that a network partition affects only those inside the partition; others can continue to chat.

We also propose to measure the time taken for a new chat client to join the session and obtain the entire transcript both in the single-master and multi-master implementations, and show that the latter takes less time.

5.2.4 Scoreboard

Data dissemination differs from real-time data exchange in that data is typically updated by only one entity, whereas it is potentially monitored by a large number of clients, typically with varying consistency requirements and tolerance to staleness of data. Even though IP multicasting [6] performs better for this, achieving data multicast as a side-effect of update propagation among replicas has several benefits. It allows intermediate data staging along the multicast path to cache data closer to its clients, reducing access latency and sender's load. It enables the quality of multicast-ed data and frequency of data updates to be customized at intermediate hops to the consistency level needed by individual replicas, unlike IP multicasting, where data quality is controlled only at the sender. Client control over data quality level is desirable in applications such as multimedia streaming because there is cost (e.g., bandwidth utilization) associated with maintaining a given data quality level and it depends on the bandwidth available to the client.

We demonstrate the value of adapting update propagation to varying consistency levels on a per-replica basis by implementing a scoreboard. The scoreboard application broadcasts the status of an ongoing game updated (sporadically) at a site to registered listeners across a WAN. Each listener can set how often her copy needs to be refreshed (based on either time or the maximum number of unnotified scoreboard updates tolerated). Each listener can change this refresh rate dynamically as the game progresses at the remote site. User can also explicitly request a refresh. The scoreboard state is kept in a Khazana file. The file's replica consistency policy can be set to single-master push-based last-writer-wins optimistic consistency. The consistency management system can batch updates before propagating them and tune the amount of update traffic generated based on network conditions and tolerated staleness by listeners. This application illustrates one-to-many non-real-time data dissemination using Khazana's consistency management system. In addition to asynchronous updates, this application requires the ability to tune the update propagation rate at a replica site, as well as to explicitly request update propagation. Experiment 6: Customizing Data Quality to Replica-specific Needs

We evaluate how Khazana's update propagation can be adapted to provide a variety of data quality levels at individual replica sites. For this, we create a scoreboard and update it randomly once every 10-15 sec. We run clients on multiple LANs, with each client registering for updates. Some of them register for updates every D seconds, where D is chosen randomly from 15 to 120 sec. Some others choose maximum number of unnotified updates randomly from 1 to 5. We measure the overall bandwidth utilization for propagation of 1000 updates and contrast it with the case where updates are immediately propagated to everybody. We also measure the time taken by a new client to register and receive the first update with and without intermediate caching. We repeat this experiment varying the number of clients from 2 to 512 in powers of 2.

We expect the bandwidth utilization for the customized propagation case to be lower than the eager propagation case and the gap to increase for larger client sets. We also expect the time taken for first update not to degrade with increasing number of clients in case of intermediate caching, and expect it to degrade in the absence of intermediate caching. This experiment validates our claim 2(d).

5.3 Discussion

In this Chapter, we described the success criteria for our thesis by stating a set of claims which together validate our thesis statement. We proposed to implement a set of useful wide-area applications selected from distinct application classes. We then described a series of experiments and their expected results that serve to validate each of our claims. These experiments also demonstrate the performance and scalability of the applications using our consistency management system, thereby establishing its feasibility and usefulness.

6. Schedule

We currently have a working implementation of our prototype, Khazana that runs a LAN and provides API for allocating, freeing and session-based read/write access to Khazana regions. Khazana nodes (currently limited to 64) cooperate to locate regions, and create a hierarchy of cached replicas of a region as a side effect of access. We have a single-master pull-based last-writer-wins optimistic consistency algorithm working in Khazana. With this, we have implemented a prototype file system called KidFS to use a Khazana region per file. It provides a traditional hierarchical naming layer using directories implemented as fixed-size closed hash table data structures inside Khazana regions.

August, September:

With these, we will have enough support to run experiments 1 and 2.





Finish thesis and defend.

7. Conclusion

In this thesis, we proposed to show the feasibility and value of implementing a consistency management system that is flexible enough to serve the data access and delivery needs of a variety of wide-area applications, in the context of a wide-area data storage and caching infrastructure called Khazana. We target applications in the three classes, namely those involving 1) read/write access to shared data/state, 2) data dissemination and 3) data exchange among distributed components over the wide-area. Our approach to achieving scalability, flexibility and performance is to provide a set of useful consistency guarantees for our targeted application classes while exposing enough hooks to allow applications to guide the system's consistency management mechanisms. Unlike other systems that provide one or the other of scalable wide-area support and flexible consistency, our approach is to provide both in a single system with control over its mechanisms.

We proposed a plan to evaluate the effectiveness of our approach and its scalability in the number of objects and clients. Our approach provides several benefits over existing systems. Specifically, it enables reuse of consistency mechanisms for a variety of applications with coexistence of different consistency schemes in the same system; it enables customization of implementation for performance; it also helps achieve scalable data delivery across the wide-area; it conserves bandwidth by enabling applications to customize data quality during update propagation to replica-specific needs.


M.A. Blaze.
Caching in Large Scale Distributed File Systems.
PhD thesis, Princeton University, 1993.

W.J. Bolosky, J.R. Douceur, D. Ely, and M. Theimer.
Feasibility of a serverless distributed file system deployed on an existing set of desktop pcs.
In Proceedings of the International Conference on Measurement and Modeling of Computer Systems (Sigmetrics '01), pages 34-43, 2001.

J.B. Carter.
Design of the munin distributed shared memory system.
Journal of Parallel and Distributed Computing, 29(2):219-227, September 1995.

Internet Relay Chat.

L.P. Cox and B.D. Noble.
Fast reconciliations in Fluid Replication.
In Proceedings of the 21st International Conference on Distributed Conputing Systems, April 2001.

S.E. Deering.
Multicast routing in internetworks and extended LANs.
In Proceedings of the Sigcomm '88 Symposium, pages 55-64, August 1988.

A. Demers, K. Petersen, M. Spreitzer, D. Terry, M. Theimer, and B. Welch.
The Bayou architecture: Support for data sharing among mobile users.
In Proceedings of the Workshop on Mobile Computing Systems and Applications, December 1994.

The Utah Network Testbed.
http://www.emulab.net/, 2001.

P. Ferreira and M. Shapiro et. al.
PerDis: Design, implementation, and use of a PERsistent DIstributed Store.
In Submitted to The Third Symposium on Operating System Design and Implementation, February 1999.

A. Fox, , and E.A. Brewer.
Harvest, yield and scalable tolerant systems.
In Proceedings of the Seventh Workshop on Hot Topics in Operating Systems, March 1999.

S. Gadde, J. Chase, and M. Rabinovich.
Web caching and content distribution: A view from the interior.
In 5th International Web Caching and Content Delivery Workshop, May 2000.

D.K. Gifford.
Weighted voting for replicated data.
In Proceedings of the 13th ACM Symposium on Operating Systems Principles, pages 150-162, 1979.

A. Goel, C. Pu, and G. Popek.
View consistency for optimistic replication.
In 17th IEEE Symposium on Reliable Distributed Systems, October 1998.

Steven Gribble, Alon Halevy, Zachary Ives, Maya Rodrig, and Dan Suciu.
Scalable, distributed data structures for internet service construction.
In Proceedings of the Fourth Symposium on Operating System Design and Implementation, 2000.

A.S. Grimshaw and W.W. Wulf.
The Legion vision of a worldwide virtual computer.
Communications of the ACM, 40(1), January 1997.

J. Gwertzman and M. Seltzer.
World-wide web cache consistency.
In Proceedings of the USENIX 1996 Annual Technical Conf., January 1996.

J. Howard, M. Kazar, S. Menees, D. Nichols, M. Satyanarayanan, R. Sidebotham, and M. West.
Scale and performance in a distributed file system.
ACM Transactions on Computer Systems, 6(1):51-82, February 1988.

Napster Inc.
http://www.napster.com/, 2000.

P.J. Keleher.
Decentralized replicated-object protocols.
In Proceedings of the 18th Annual ACM Symposium on Principles of Distributed Computing, April 1999.

J.J. Kistler and M. Satyanarayanan.
Disconnected operation in the Coda file system.
In Proceedings of the 13th ACM Symposium on Operating Systems Principles, pages 213-225, October 1991.

J. Kubiatowicz, D. Bindel, Y. Chen, S. Czerwinski, P. Eaton, D. Geels, R. Gummadi, S. Rhea, H. Weatherspoon, W. Weimer, C. Wells, , and B. Zhao.
OceanStore: An architecture for global-scale persistent storage.
In Proceedings of the 9th Symposium on Architectural Support for Programming Languages and Operating Systems, November 2000.

B. Liskov, A. Adya, M. Castro, M. Day, S. Ghemawat, R. Gruber, U. Maheshwari, A. C. Myers, and L. Shrira.
Safe and efficient sharing of persistent objects in Thor.
In Proceedings of SIGMOD '96, June 1996.

M. Macedonia and D. Brutzman.
Mbone provides audio and video across the internet.
IEEE Computer, pages 30-36, April 1994.

Microsoft Corp.
Windows 2000 server resource kit.
Microsoft Press, 2000.

P.V. Mockapetris and K. Dunlap.
Development of the domain name system.
In Proceedings of the Sigcomm '88 Symposium, August 1988.

Oracle Corp.
Oracle 7 Server Distributed Systems Manual, Vol. 2, 1996.

J.-F. Pâris.
Voting with bystanders.
In Proceedings of the 9th International Conference on Distributed Computing Systems, pages 394-401, May 1989.

Karin Petersen, Mike J. Spreitzer, Douglas B. Terry, Marvin T. Theimer, and Alan J. Demers.
Flexible update propagation for weakly consistent replication.
http://www.parc.xerox.com/csl/projects/bayou/pubs/sosp-97/index.html, 1997.

P. Reiher, J. Heidemann, D. Ratner, G. Skinner, and G. Popek.
Resolving file conflicts in the Ficus file system.
In Proceedings of the 1994 Summer Usenix Conference, 1994.

A. Rowstron and P. Druschel.
Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility.
In Proceedings of the 16th ACM Symposium on Operating Systems Principles, 2001.

A. Rowstron, A-M. Kermarrec, P. Druschel, and M. Castro.
SCRIBE: The design of a large-scale event notification infrastructure.
Submitted for publication, June 2001.

Y. Saito.
Consistency management in optimistic replication algorithms.
http://www.hpl.hp.com/personal/Yasushi_Saito/replica.pdf, June 2001.

Y. Saito, B.N. Bershad, and H. Levy.
Manageability, availability and performance in Porcupine: A highly scalable internet mail service.
ACM Transactions on Computer Systems, August 2000.

Y. Saito, J. Mogul, and B. Verghese.
A usenet performance study.
http://www.research.digital.com/wrl/projects/newsbench/, September 1998.

Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakrishnan.
Chord: A scalable peer-to-peer lookup service for internet applications.
In Proceedings of the Sigcomm '01 Symposium, San Diego, California, August 2001.

A. Vahdat.
Operating System Services For Wide Area Applications.
PhD thesis, University of California, Berkeley, CA, 1998.

R. van Renesse and A.S. Tanenbaum.
Voting with ghosts.
In Proceedings of the 8th International Conference on Distributed Computing Systems, pages 456-461, May 1988.

M. van Steen, P. Homburg, and A.S. Tanenbaum.
Architectural design of Globe: A wide-area distributed system.
Technical Report IR-422, Vrije Universiteit, Department of Mathematics and Computer Science, March 1997.

H. Yu and A. Vahdat.
Design and evaluation of a continuous consistency model for replicated services.
In Proceedings of the Fourth Symposium on Operating System Design and Implementation, October 2000.

B.Y. Zhao, J. Kubiatowicz, and A.D. Joseph.
Tapestry: An infrastructure for wide-area fault-tolerant location and routing.
Submitted for publication, 2001.

S.Q. Zhuang, B.Y. Zhao, A.D. Joseph, R.H. Katz, and J. Kubiatowicz.
Bayeux: An architecture for scalable and fault-tolerant wide-area data dissemination.
In Proceedings of the Eleventh International Workshop on Network and Operating System Support for Digital Audio and Video, 2001.

About this document ...

Flexible Consistency Management in a Wide-area Caching Data Store

This document was generated using the LaTeX2HTML translator Version 99.2beta8 (1.46)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html proposal.tex

The translation was initiated by Sai Rama Krishna Susarla on 2001-10-16


... copies3.1
We use the terms copy and replica interchangeably in the rest of this document.
We use the terms Khazana file and Khazana region interchangeably in the rest of this document.

Sai Rama Krishna Susarla 2001-10-16