Lecture 18 Eventual Consistency, Dynamo

Dynamo: Amazon's Highly Available Key-value Store
DeCandia et al, SOSP 2007

Why are we reading this paper?
- Database, eventually consistent, write any replica.
  - Like Bayou - but a database! A surprising design.
- A real system: used for e.g. shopping cart at Amazon.
  - Less clear whether/how much this is used at Amazon today.
- More available than Spanner.
- Less consistent than Spanner.
- Influential design; inspired e.g. Cassandra
Their Obsessions
- SLA, e.g. 99.9th percentile of delay < 300 ms
- constant failures
- "data centers being destroyed by tornadoes"
- "always writeable"
Big picture
- [lots of data centers, Dynamo nodes]
- each item replicated at a few random nodes, by key hash
Why replicas at just a few sites? Why not replica at every site?
- with two data centers, site failure takes down 1/2 of nodes
  - so need to be careful that everything replicated at both sites
- with 10 data centers, site failure affects small fraction of nodes
  - so just need copies at a few sites
Consequences of mostly remote access (since no guaranteed local copy)
- most puts/gets may involve WAN traffic - high delays
  - maybe distinct Dynamo instances with limited geographical scope?
  - paper quotes low average delays in graphs but does not explain
- more vulnerable to network failure than Bayou
  - again since no local copy
Consequences of "always writeable"
- always writeable => no master! must be able to write locally.
- always writeable + failures = conflicting versions
Idea #1: eventual consistency
- accept writes at any replica
- allow divergent replicas
- allow reads to see stale or conflicting data
- resolve multiple versions when failures go away
  - latest version if no conflicting updates
  - if conflicts, reader must merge and then write
- like Bayou and Ficus - but in a DB
Unhappy consequences of eventual consistency
- May be no unique "latest version"
- Read can yield multiple conflicting versions
- Application must merge and resolve conflicts
- No atomic operations (e.g. no compare-and-swap)
- Also: keep in mind here, this is a simple KVS; it isn't transactional (unlike most of the systems we've talked about so far).
Idea #2: sloppy quorum
- try to get consistency benefits of single master if no failures
  - but allows progress even if coordinator fails
- when no failures, send reads/writes through single node
  - the coordinator
  - causes reads to see writes in the usual case
- but don't insist! allow reads/writes to any replica if failures
Where to place data - consistent hashing
- [ring, and physical view of servers]
- node ID = random
- key ID = hash(key)
- coordinator: successor of key
  - clients send puts/gets to coordinator
- replicas at successors - "preference list"
- coordinator forwards puts (and gets...) to nodes on preference list
Why consistent hashing?
- Pro
  - naturally somewhat balanced
  - decentralized - both lookup and join/leave
- Con (section 6.2)
  - not really balanced (why not?), need virtual nodes
  - hard to control placement (balancing popular keys, spread over sites)
  - join/leave changes partition, requires data to shift
  - Assumes uniform op cost.
Failures
- Tension: temporary or permanent failure?
  - node unreachable - what to do?
  - if temporary, store new puts elsewhere until node is available
  - if permanent, need to make new replica of all content
- Dynamo itself treats all failures as temporary
Temporary failure handling: quorum
- goal: do not block waiting for unreachable nodes
- goal: put should always succeed
- goal: get should have high prob of seeing most recent put(s)
- quorum: R + W > N
  - never wait for all N
  - but R and W will overlap
  - cuts tail off delay distribution and tolerates some failures
- N is first N reachable nodes in preference list
  - each node pings successors to keep rough estimate of up/down
  - "sloppy" quorum, since nodes may disagree on reachable
- sloppy quorum means R/W overlap not guaranteed
coordinator handling of put/get:
- sends put/get to first N reachable nodes, in parallel
- put: waits for W replies
- get: waits for R replies
- if failures aren't too crazy, get will see all recent put versions
When might this quorum scheme not provide R/W intersection?
What if a put() leaves data far down the ring?
- after failures repaired, new data is beyond N?
- that server remembers a "hint" about where data really belongs
- forwards once real home is reachable
- also - periodic "merkle tree" sync of key range
How can multiple versions arise?
- Maybe a node missed the latest write due to network problem
- So it has old data, should be superseded by newer put()s
- get() consults R, will likely see newer version as well as old
How can conflicting versions arise?
- N=3 R=2 W=2
- shopping cart, starts out empty ""
- preference list n1, n2, n3, n4
- client 1 wants to add item X
  - get() from n1, n2, yields ""
  - n1 and n2 fail
  - put("X") goes to n3, n4
- client 2 wants to delete X
  - get() from n3, n4, yields "X"
  - put("") to n3, n4
- n1, n2 revive
- client 3 wants to add Y
  - get() from n1, n2 yields ""
  - put("Y") to n1, n2
- client 3 wants to display cart
  - get() from n1, n3 yields two values!
    - "X" and "Y"
    - neither supersedes the other - the put()s conflicted
How should clients resolve conflicts on read?
- Depends on the application
- Shopping basket: merge by taking union?
  - Would un-delete item X
  - Weaker than Bayou (which gets deletion right), but simpler
- Some apps probably can use latest wall-clock time
  - E.g. if I'm updating my password
  - Simpler for apps than merging
- Write the merged result back to Dynamo
How to detect whether two versions conflict?
- As opposed to a newer version superseding an older one
- If they are not bit-wise identical, must client always merge+write?
- We have seen this problem before...
Version vectors

  Example tree of versions:
    [a:1]
           [a:1,b:2]

VVs indicate v1 supersedes v2
Dynamo nodes automatically drop [a:1] in favor of [a:1,b:2]

  Example:
    [a:1]
           [a:1,b:2]
    [a:2]
    Client must merge

get(k) may return multiple versions, along with "context"
- and put(k, v, context)
- put context tells coordinator which versions this put supersedes/merges
Won't the VVs get big?
- Yes, but slowly, since key mostly served from same N nodes
- Dynamo deletes least-recently-updated entry if VV has > 10 elements
Impact of deleting a VV entry?
- won't realize one version subsumes another, will merge when not needed:
  - put@b: [b:2]
  - put@a: [a:3, b:2]
  - forget b:2: [a:3]
  - now, if you sync w/ [b:2], looks like a merge is required
- forgetting the oldest is clever
  - since that's the element most likely to be present in other branches
  - so if it's missing, forces a merge
  - forgetting newest would erase evidence of recent difference
Is client merge of conflicting versions always possible?
- Suppose we're keeping a counter, x
- x starts out 0
- incremented twice
- but failures prevent clients from seeing each others' writes
- After heal, client sees two versions, both x=1
- What's the correct merge result?
- Can the client figure it out?
What if two clients concurrently write w/o failure?
- e.g. two clients add diff items to same cart at same time
- Each does get-modify-put
- They both see the same initial version
- And they both send put() to same coordinator
- Will coordinator create two versions with conflicting VVs?
  - We want that outcome, otherwise one was thrown away
  - Paper doesn't say, but coordinator could detect problem via put() context
Permanent server failures / additions?
- Admin manually modifies the list of servers
- System shuffles data around - this takes a long time!
The Question:
- It takes a while for notice of added/deleted server to become known to all other servers. Does this cause trouble?
- Deleted server might get put()s meant for its replacement.
- Deleted server might receive get()s after missing some put()s.
- Added server might miss some put()s b/c not known to coordinator.
- Added server might serve get()s before fully initialized.
- Dynamo probably will do the right thing:
  - Quorum likely causes get() to see fresh data as well as stale.
  - Replica sync (4.7) will fix missed get()s.
Is the design inherently low delay?
- No: client may be forced to contact distant coordinator
- No: some of the R/W nodes may be distant, coordinator must wait
What parts of design are likely to help limit 99.9th pctile delay?
- This is a question about variance, not mean
- Bad news: waiting for multiple servers takes max of delays, not e.g. avg
- Good news: Dynamo only waits for W or R out of N
  - cuts off tail of delay distribution
  - e.g. if nodes have 1% chance of being busy with something else
  - or if a few nodes are broken, network overloaded, &c
No real Eval section, only Experience
How does Amazon use Dynamo?
- shopping cart (merge)
- session info (maybe Recently Visited &c?) (most recent TS) product list (mostly r/o, replication for high read throughput)
They claim main advantage of Dynamo is flexible N, R, W
- What do you get by varying them?

  N-R-W
  3-2-2 : default, reasonable fast R/W, reasonable durability
  3-3-1 : fast W, slow R, not very durable, not useful?
  3-1-3 : fast R, slow W, durable
  3-3-3 : ??? reduce chance of R missing W?
  3-1-1 : not useful?

They had to fiddle with the partitioning / placement / load balance (6.2)
- Old scheme:
  - Random choice of node ID meant new node had to split old nodes' ranges
  - Which required expensive scans of on-disk DBs
- New scheme:
  - Pre-determined set of Q evenly divided ranges
  - Each node is coordinator for a few of them
  - New node takes over a few entire ranges
  - Store each range in a file, can xfer whole file
How useful is ability to have multiple versions? (6.3)
- i.e. how useful is eventual consistency
- This is a Big Question for them
- 6.3 claims 0.001% of reads see divergent versions
  - I believe they mean conflicting versions (not benign multiple versions)
  - Is that a lot, or a little?
- So perhaps 0.001% of writes benefitted from always-writeable?
  - i.e. would have blocked in primary/backup scheme?
- Very hard to guess:
  - They hint that the problem was concurrent writers, for which better solution is single master
  - But also maybe their measurement doesn't count situations where availability would have been worse if single master
Performance / throughput (Figure 4, 6.1)
- Figure 4 says average 10ms read, 20 ms writes
  - the 20 ms must include a disk write
  - 10 ms probably includes waiting for R/W of N
- Figure 4 says 99.9th pctil is about 100 or 200 ms
  - Why?
  - "request load, object sizes, locality patterns"
  - does this mean sometimes they had to wait for coast-coast msg?
Puzzle: why are the average delays in Figure 4 and Table 2 so low?
- Implies they rarely wait for WAN delays
- But Section 6 says "multiple datacenters"
  - you'd expect most coordinators and most nodes to be remote!
  - Maybe all datacenters are near Seattle?
Wrap-up
- Big ideas:
  - eventual consistency
  - always writeable despite failures
  - allow conflicting writes, client merges
- Awkward model for some applications (stale reads, merges)
  - this is hard for us to tell from paper
- Maybe a good way to get high availability + no blocking on WAN
  - but PNUTS master scheme implies Yahoo thinks not a problem
- No agreement on whether eventual consistency is good for storage systems

CS6963 Distributed Systems

Lecture 18 Eventual Consistency, Dynamo