CS6963 Distributed Systems

Lecture 19 COPS

Don’t Settle for Eventual:
Scalable Causal Consistency for Wide-Area Storage with COPS
Lloyd, Freedman, Kaminsky, and Andersen

  • Why this paper?

    • Give us a chance to finally orient all of the consistency models we've seen in class so far.
    • Causal consistency was an interesting point of debate in the 90s in the community.
    • So far eventually consistent and linearizable models seem to be winning out.
    • A solid attempt to solve the problem that Facebook faced.
    • Gives strong local semantics, without completely giving up when replicating across datacenters.
  • Same setting as others:

    • Partitioned KVS
    • Many datacenters
    • Clients at each site
    • Want to replicate keys at each site
      • for availability
      • for "low latency"
    • Dynamo, Bayou, Spanner... all have different takes with a similar goal.
  • Big idea/picture:

    • Want write in local cluster and asynchronously replicate to remotes.
    • Can we still provide senisible semantics?
    • This has been a big theme in the second half of class.
    • Take on Eventual Consistency:
      • "non only might subsequent reads not reflect the latest value, reads across multiple objects might reflect an incoherent mix of old and new values."
    • Causal+
      • If you see something, you also see every affect that causally precedeed it.

A Detour on CAP

  • CAP Theorem

    • Consistency: "Linearizability"
    • Availability: A request to any node must be able to respond immediately.
    • Partition tolerance: the system doesn't trash things when nodes get disconnected in arbitrary patterns.
    • Idea: You need P or your system doesn't work under faults.
    • Given that only A or C is possible.
    • Which should you choose?
  • ALPS

    • Availability
    • Low latency
    • Partition tolerance
    • High scalability: adding N resources improves perf by O(N).
      • This isn't at odds with C from above...
    • Stronger Consistency
      • The punchline for their work.
    • Bayou isn't H or S?
    • Dynamo isn't S?
    • Spanner isn't A or L by their definition.

Causal+

  • What is the strongest model we can provide in a disconnected way?

    • Causal+
  • Example (Page 1):

    • Upload a picture to site.
    • Reference added from album.
    • Are there cases where we can "see" the reference but not the picture?
    • Under Linearizability?
      • No anomaly. Why? Poster did Upload then did AddRef.
      • Upload finished before AddRef started.
      • If I can see AddRef, then my Read started after Upload applied.
      • And, I know I "see" the same order as the poster. QED.
    • Under Sequential Consistency.
      • No anomaly.
      • Same argument for Poster.
      • If I "see" AddRef, I must see the same order as Poster. QED.
    • Under Eventual Consistecny?
      • Think of Dynamo.
      • Poster does Upload to one set of N, AddRef to another.
      • I can get an old (or null) value for the photo, and the fresh reference.
    • What about this Causal+?
      • Suppose I'm reading the album from a remote DC.
      • I'm working from replicas that may be missing the most recent updates.
      • If I see the AddRef, then I'll be guaranteed to see the Upload.
      • Why? Upload casually preceeded AddRef, so I'm guaranteed to see it.
  • Relationship of all the consistency models we've seen so far.

    • [Figure 3 (Page 4).]
    • Linearizable > Sequential Consistency > Causal+ > Eventual Consistency
    • Linearizability
      • "Operations appear to happen atomicially between invocation and response."
      • Requires a total order: part of "atomically".
    • Sequential consistency
      • The result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.
      • Same as above but drop the "between invocation and response".
      • Still requires a total order, but that order may not be consistent with request/response.
    • Eventual consistency
      • Keys on nodes may contain a mix of old and new values.
      • Given enough time values will progress in history.
    • Not necessarily in the same order or in a consistent way.
    • May never be a point where "everyone" knows "everything", so this model doesn't even require any pair of nodes are ever "up-to-date" with respect to each other.
    • May not even "read your own writes".
    • Dynamo: write to one set of N, then read from different set...

COPS

  • KVS
    • Put(ctxt, key, val), Get(ctxt, key), CreateContext(), DeleteContext(ctxt)
    • Somewhat like Dynamo.
    • When we read some causal context is added to our thread.
    • When we write that causal context is stored with the value.
    • Future reads of that value will use the dependcies to ensure that if they "see" that value, they will also "see" everything that causally preceeded it.
    • BUT: local writes are linearizable! Why not just send them to the remote store in the same order? We can do that for local, but what about writes at the remote site? They can be 'in parallel' with us.
Client 1:  put(x, 1) -> put(y, 2) -> put(x, 3)
                           V
Client 2:               get(y)=2  -> put(x, 4)
                                        V
Client 3:                            get(x)=4 -> put(z, 5)
  • C2 reading y=2, means a read of x would see x=1

  • How is causality tracked (Page 3)?

    • Execution thread (similar to sequential consistency): gives read your writes.
    • Get From. a -> b if b is a get that reads the put a.
    • Transitivity. if a -> b and b -> c, then a -> c.
  • Point: must ensure that clients only communicate through the data store or all bets are off.

    • e.g. We are sitting next to each other. I post my picture and add it to the album. I refresh the page: looks good.
    • I turn to you and say: I did it! Refresh now.
    • You refresh: perhaps nothing!!!
    • Why? Because things are lagging in the DB.
    • The guarantee you have is that once you can see the ref, you can see the photo.
    • Same scenario except I send you a Facebook message to tell you that I posted the photo? Would you be sure to see it?
    • Yes. Photo -> AddRef -> Message -> get(): Message, so you will.
    • This is a key complaint/controversy over causal consistency.
    • Q: Would this be fixed with seq consistency or serializability?
      • Nope. Only Linearizability (which is why it's called external consistency in the txn world).
    • Can't tell the difference between these models if the DB is the only way to communicate.
  • Above example:

    • get(2)=2 -> put(x, 4): execution thread dep
    • put(y,2) -> get(y)=2: gets from dep
    • put(y,2) -> put(x,4): transitivity
  • What are the contexts for?

    • Why does COPS need these when other didn't?
    • Avoids false dependencies.
  • Ok: so weird problem above.

    • Causal consistency would allow put(x, 3) by C1 above and
    • put(x, 4) by C2 to live on indefinitely.
    • C1 (and other clients) can create a whole world of state based on x=3 and another set of clients can work assuming x=4.
    • Really same as in Bayou: conflict can lead to forks of the world.
    • How do we reconcile?
  • Convergent conflict handling

    • Require a commutative and associative conflict handler
    • Thomas's write rule: last-writer-wins?
    • How do we define last?
    • "Doesn't matter" as long as it is determinstic.
    • If it isn't then if DC A sends updates to B and B to A, then they may remain forked...
  • Let's take a second to reflect on that this means.

    • [Example middle-left Page 4.]
    • Event on a calendar.
    • Carol changes start time from 9 PM to 8 PM.
    • Dan changes start time from 9 PM to 10 PM.
    • Could this have happened with Linearizability ot SC?
      • Yes, but we could CAS.
    • Can't do that here: still looks like 9 PM when update is done.
    • Casual consistency: diverged state forever is ok.
    • Causal+: just pick 8 or 10 randomly but deterministically.
  • Ok: the actual KVS now. Let's stick to the simple one without GT.

    • [Figure 4 (Page 5).]
    • Dependencies: reverse of causal relationship.
    • Client library maintains set of values read and their version numbers.
    • On put, it sends along the version numbers of all of the values in the current context.
    • Server executes the put
    • Returns a new version number (Lamport clock, server id)
    • Client context is cleared and populated with just that put.
      • Why? Imagine another put. If that put depends on the prior, then it must precede all of the other things the prior put added to its dep list.
    • Meanwhile, the server puts the operation in a remote replication queue bound toward the replica for the same key on the remote side.
    • The remote side receives the put operation along with its dependencies and its version number.
    • Nodes on the remote side wait to apply the put until all dependencies are satisfied.
      • e.g. Some of the values it depends on might be on other nodes in the remote cluster. And some of those may not yet applied the updates that this update depends on.
    • After that update is applied.
  • Think back to the Photo album.

    • AddPhoto might get Photo@1.
    • AddRef might get RefToPhoto@2.
    • RefToPhoto@2 might arrive at its remote first!!!
    • Need to stall until Photo@1 is applied.
    • Remote put for RefToPhoto@2 runs "dep_check(Photo@1)" which blocks until true.
  • What have gained?

    • Updates are only applied at the remote site in an order that respects causality.
    • Notice we didn't have to sequence or serialize all ops to the remote site anywhere.
    • No single point of coordination/ordering among record updates
  • Question: what about a lot of interleaved writes?

    • Site 1: put(x, 1), put(y, 1), put(z, 1)
    • Site 2: put(x, 2), put(y, 2), put(z, 2)
    • What are the possible values we can read for x, y, z after these?
    • Little hard to say: causal relationship across each thread.
    • But, last-writer timestamps might be interleaved?
    • Based on how timestamp are alloced (Middle-left Page 7) seems like probably one nodes values will dominate the other's in this case.
    • Seems like might see either value for any of them, some may flip after a bit, but then they should settle to one of the two values each.
  • Recap:

    • We want fast local writes/reads.
    • We want remote replication.
    • We want disconnect operation.
    • But we want some way to reason about what we see/order of operations.
    • Don't want to sequence at some central point.
    • How well will this work with disconnects?
      • A story for app-level conflict resolution ala Bayou.