Lecture 19 COPS
Don’t Settle for Eventual:
Scalable Causal Consistency for Wide-Area Storage with COPS
Lloyd, Freedman, Kaminsky, and Andersen
Why this paper?
- Give us a chance to finally orient all of the consistency models we've seen
in class so far.
- Causal consistency was an interesting point of debate in the 90s in the
community.
- So far eventually consistent and linearizable models seem to be winning
out.
- A solid attempt to solve the problem that Facebook faced.
- Gives strong local semantics, without completely giving up when replicating
across datacenters.
Same setting as others:
- Partitioned KVS
- Many datacenters
- Clients at each site
- Want to replicate keys at each site
- for availability
- for "low latency"
- Dynamo, Bayou, Spanner... all have different takes with a similar goal.
Big idea/picture:
- Want write in local cluster and asynchronously replicate to remotes.
- Can we still provide senisible semantics?
- This has been a big theme in the second half of class.
- Take on Eventual Consistency:
- "non only might subsequent reads not reflect the latest
value, reads across multiple objects might reflect an incoherent mix of
old and new values."
- Causal+
- If you see something, you also see every affect that causally precedeed
it.
A Detour on CAP
CAP Theorem
- Consistency: "Linearizability"
- Availability: A request to any node must be able to respond immediately.
- Partition tolerance: the system doesn't trash things when nodes get
disconnected in arbitrary patterns.
- Idea: You need P or your system doesn't work under faults.
- Given that only A or C is possible.
- Which should you choose?
ALPS
- Availability
- Low latency
- Partition tolerance
- High scalability: adding N resources improves perf by O(N).
- This isn't at odds with C from above...
- Stronger Consistency
- The punchline for their work.
- Bayou isn't H or S?
- Dynamo isn't S?
- Spanner isn't A or L by their definition.
Causal+
COPS
- KVS
- Put(ctxt, key, val), Get(ctxt, key), CreateContext(), DeleteContext(ctxt)
- Somewhat like Dynamo.
- When we read some causal context is added to our thread.
- When we write that causal context is stored with the value.
- Future reads of that value will use the dependcies to ensure that if they
"see" that value, they will also "see" everything that causally preceeded
it.
- BUT: local writes are linearizable! Why not just send them to the remote
store in the same order? We can do that for local, but what about writes at
the remote site? They can be 'in parallel' with us.
Client 1: put(x, 1) -> put(y, 2) -> put(x, 3)
V
Client 2: get(y)=2 -> put(x, 4)
V
Client 3: get(x)=4 -> put(z, 5)
C2 reading y=2, means a read of x would see x=1
How is causality tracked (Page 3)?
- Execution thread (similar to sequential consistency): gives read your
writes.
- Get From. a -> b if b is a get that reads the put a.
- Transitivity. if a -> b and b -> c, then a -> c.
Point: must ensure that clients only communicate through the data store or
all bets are off.
- e.g. We are sitting next to each other. I post my picture and add it to the
album. I refresh the page: looks good.
- I turn to you and say: I did it! Refresh now.
- You refresh: perhaps nothing!!!
- Why? Because things are lagging in the DB.
- The guarantee you have is that once you can see the ref, you can see the
photo.
- Same scenario except I send you a Facebook message to tell you that I
posted the photo? Would you be sure to see it?
- Yes. Photo -> AddRef -> Message -> get(): Message, so you will.
- This is a key complaint/controversy over causal consistency.
- Q: Would this be fixed with seq consistency or serializability?
- Nope. Only Linearizability (which is why it's called external
consistency in the txn world).
- Can't tell the difference between these models if the DB is the only way to
communicate.
Above example:
- get(2)=2 -> put(x, 4): execution thread dep
- put(y,2) -> get(y)=2: gets from dep
- put(y,2) -> put(x,4): transitivity
What are the contexts for?
- Why does COPS need these when other didn't?
- Avoids false dependencies.
Ok: so weird problem above.
- Causal consistency would allow put(x, 3) by C1 above and
- put(x, 4) by C2 to live on indefinitely.
- C1 (and other clients) can create a whole world of state based on x=3
and another set of clients can work assuming x=4.
- Really same as in Bayou: conflict can lead to forks of the world.
- How do we reconcile?
Convergent conflict handling
- Require a commutative and associative conflict handler
- Thomas's write rule: last-writer-wins?
- How do we define last?
- "Doesn't matter" as long as it is determinstic.
- If it isn't then if DC A sends updates to B and B to A, then they may
remain forked...
Let's take a second to reflect on that this means.
- [Example middle-left Page 4.]
- Event on a calendar.
- Carol changes start time from 9 PM to 8 PM.
- Dan changes start time from 9 PM to 10 PM.
- Could this have happened with Linearizability ot SC?
- Can't do that here: still looks like 9 PM when update is done.
- Casual consistency: diverged state forever is ok.
- Causal+: just pick 8 or 10 randomly but deterministically.
Ok: the actual KVS now. Let's stick to the simple one without GT.
- [Figure 4 (Page 5).]
- Dependencies: reverse of causal relationship.
- Client library maintains set of values read and their version numbers.
- On put, it sends along the version numbers of all of the values in the
current context.
- Server executes the put
- Returns a new version number (Lamport clock, server id)
- Client context is cleared and populated with just that put.
- Why? Imagine another put. If that put depends on the prior, then it
must precede all of the other things the prior put added to its dep
list.
- Meanwhile, the server puts the operation in a remote replication queue
bound toward the replica for the same key on the remote side.
- The remote side receives the put operation along with its dependencies and
its version number.
- Nodes on the remote side wait to apply the put until all dependencies are
satisfied.
- e.g. Some of the values it depends on might be on other nodes in the
remote cluster. And some of those may not yet applied the updates that
this update depends on.
- After that update is applied.
Think back to the Photo album.
- AddPhoto might get Photo@1.
- AddRef might get RefToPhoto@2.
- RefToPhoto@2 might arrive at its remote first!!!
- Need to stall until Photo@1 is applied.
- Remote put for RefToPhoto@2 runs "dep_check(Photo@1)" which blocks until
true.
What have gained?
- Updates are only applied at the remote site in an order that respects
causality.
- Notice we didn't have to sequence or serialize all ops to the remote site
anywhere.
- No single point of coordination/ordering among record updates
Question: what about a lot of interleaved writes?
- Site 1: put(x, 1), put(y, 1), put(z, 1)
- Site 2: put(x, 2), put(y, 2), put(z, 2)
- What are the possible values we can read for x, y, z after these?
- Little hard to say: causal relationship across each thread.
- But, last-writer timestamps might be interleaved?
- Based on how timestamp are alloced (Middle-left Page 7) seems like probably
one nodes values will dominate the other's in this case.
- Seems like might see either value for any of them, some may flip after a
bit, but then they should settle to one of the two values each.
Recap:
- We want fast local writes/reads.
- We want remote replication.
- We want disconnect operation.
- But we want some way to reason about what we see/order of operations.
- Don't want to sequence at some central point.
- How well will this work with disconnects?
- A story for app-level conflict resolution ala Bayou.