CS6963 Distributed Systems

Lecture 18 Eventual Consistency, Dynamo

Dynamo: Amazon's Highly Available Key-value Store
DeCandia et al, SOSP 2007

  • Why are we reading this paper?

    • Database, eventually consistent, write any replica.
      • Like Bayou - but a database! A surprising design.
    • A real system: used for e.g. shopping cart at Amazon.
      • Less clear whether/how much this is used at Amazon today.
    • More available than Spanner.
    • Less consistent than Spanner.
    • Influential design; inspired e.g. Cassandra
  • Their Obsessions

    • SLA, e.g. 99.9th percentile of delay < 300 ms
    • constant failures
    • "data centers being destroyed by tornadoes"
    • "always writeable"
  • Big picture

    • [lots of data centers, Dynamo nodes]
    • each item replicated at a few random nodes, by key hash
  • Why replicas at just a few sites? Why not replica at every site?

    • with two data centers, site failure takes down 1/2 of nodes
      • so need to be careful that everything replicated at both sites
    • with 10 data centers, site failure affects small fraction of nodes
      • so just need copies at a few sites
  • Consequences of mostly remote access (since no guaranteed local copy)

    • most puts/gets may involve WAN traffic - high delays
      • maybe distinct Dynamo instances with limited geographical scope?
      • paper quotes low average delays in graphs but does not explain
    • more vulnerable to network failure than Bayou
      • again since no local copy
  • Consequences of "always writeable"

    • always writeable => no master! must be able to write locally.
    • always writeable + failures = conflicting versions
  • Idea #1: eventual consistency

    • accept writes at any replica
    • allow divergent replicas
    • allow reads to see stale or conflicting data
    • resolve multiple versions when failures go away
      • latest version if no conflicting updates
      • if conflicts, reader must merge and then write
    • like Bayou and Ficus - but in a DB
  • Unhappy consequences of eventual consistency

    • May be no unique "latest version"
    • Read can yield multiple conflicting versions
    • Application must merge and resolve conflicts
    • No atomic operations (e.g. no compare-and-swap)
    • Also: keep in mind here, this is a simple KVS; it isn't transactional (unlike most of the systems we've talked about so far).
  • Idea #2: sloppy quorum

    • try to get consistency benefits of single master if no failures
      • but allows progress even if coordinator fails
    • when no failures, send reads/writes through single node
      • the coordinator
      • causes reads to see writes in the usual case
    • but don't insist! allow reads/writes to any replica if failures
  • Where to place data - consistent hashing

    • [ring, and physical view of servers]
    • node ID = random
    • key ID = hash(key)
    • coordinator: successor of key
      • clients send puts/gets to coordinator
    • replicas at successors - "preference list"
    • coordinator forwards puts (and gets...) to nodes on preference list
  • Why consistent hashing?

    • Pro
      • naturally somewhat balanced
      • decentralized - both lookup and join/leave
    • Con (section 6.2)
      • not really balanced (why not?), need virtual nodes
      • hard to control placement (balancing popular keys, spread over sites)
      • join/leave changes partition, requires data to shift
      • Assumes uniform op cost.
  • Failures

    • Tension: temporary or permanent failure?
      • node unreachable - what to do?
      • if temporary, store new puts elsewhere until node is available
      • if permanent, need to make new replica of all content
    • Dynamo itself treats all failures as temporary
  • Temporary failure handling: quorum

    • goal: do not block waiting for unreachable nodes
    • goal: put should always succeed
    • goal: get should have high prob of seeing most recent put(s)
    • quorum: R + W > N
      • never wait for all N
      • but R and W will overlap
      • cuts tail off delay distribution and tolerates some failures
    • N is first N reachable nodes in preference list
      • each node pings successors to keep rough estimate of up/down
      • "sloppy" quorum, since nodes may disagree on reachable
    • sloppy quorum means R/W overlap not guaranteed
  • coordinator handling of put/get:

    • sends put/get to first N reachable nodes, in parallel
    • put: waits for W replies
    • get: waits for R replies
    • if failures aren't too crazy, get will see all recent put versions
  • When might this quorum scheme not provide R/W intersection?

  • What if a put() leaves data far down the ring?

    • after failures repaired, new data is beyond N?
    • that server remembers a "hint" about where data really belongs
    • forwards once real home is reachable
    • also - periodic "merkle tree" sync of key range
  • How can multiple versions arise?

    • Maybe a node missed the latest write due to network problem
    • So it has old data, should be superseded by newer put()s
    • get() consults R, will likely see newer version as well as old
  • How can conflicting versions arise?

    • N=3 R=2 W=2
    • shopping cart, starts out empty ""
    • preference list n1, n2, n3, n4
    • client 1 wants to add item X
      • get() from n1, n2, yields ""
      • n1 and n2 fail
      • put("X") goes to n3, n4
    • client 2 wants to delete X
      • get() from n3, n4, yields "X"
      • put("") to n3, n4
    • n1, n2 revive
    • client 3 wants to add Y
      • get() from n1, n2 yields ""
      • put("Y") to n1, n2
    • client 3 wants to display cart
      • get() from n1, n3 yields two values!
        • "X" and "Y"
        • neither supersedes the other - the put()s conflicted
  • How should clients resolve conflicts on read?

    • Depends on the application
    • Shopping basket: merge by taking union?
      • Would un-delete item X
      • Weaker than Bayou (which gets deletion right), but simpler
    • Some apps probably can use latest wall-clock time
      • E.g. if I'm updating my password
      • Simpler for apps than merging
    • Write the merged result back to Dynamo
  • How to detect whether two versions conflict?

    • As opposed to a newer version superseding an older one
    • If they are not bit-wise identical, must client always merge+write?
    • We have seen this problem before...
  • Version vectors

  Example tree of versions:
    [a:1]
           [a:1,b:2]
  • VVs indicate v1 supersedes v2
  • Dynamo nodes automatically drop [a:1] in favor of [a:1,b:2]
  Example:
    [a:1]
           [a:1,b:2]
    [a:2]
    Client must merge
  • get(k) may return multiple versions, along with "context"

    • and put(k, v, context)
    • put context tells coordinator which versions this put supersedes/merges
  • Won't the VVs get big?

    • Yes, but slowly, since key mostly served from same N nodes
    • Dynamo deletes least-recently-updated entry if VV has > 10 elements
  • Impact of deleting a VV entry?

    • won't realize one version subsumes another, will merge when not needed:
      • put@b: [b:2]
      • put@a: [a:3, b:2]
      • forget b:2: [a:3]
      • now, if you sync w/ [b:2], looks like a merge is required
    • forgetting the oldest is clever
      • since that's the element most likely to be present in other branches
      • so if it's missing, forces a merge
      • forgetting newest would erase evidence of recent difference
  • Is client merge of conflicting versions always possible?

    • Suppose we're keeping a counter, x
    • x starts out 0
    • incremented twice
    • but failures prevent clients from seeing each others' writes
    • After heal, client sees two versions, both x=1
    • What's the correct merge result?
    • Can the client figure it out?
  • What if two clients concurrently write w/o failure?

    • e.g. two clients add diff items to same cart at same time
    • Each does get-modify-put
    • They both see the same initial version
    • And they both send put() to same coordinator
    • Will coordinator create two versions with conflicting VVs?
      • We want that outcome, otherwise one was thrown away
      • Paper doesn't say, but coordinator could detect problem via put() context
  • Permanent server failures / additions?

    • Admin manually modifies the list of servers
    • System shuffles data around - this takes a long time!
  • The Question:

    • It takes a while for notice of added/deleted server to become known to all other servers. Does this cause trouble?
    • Deleted server might get put()s meant for its replacement.
    • Deleted server might receive get()s after missing some put()s.
    • Added server might miss some put()s b/c not known to coordinator.
    • Added server might serve get()s before fully initialized.
    • Dynamo probably will do the right thing:
      • Quorum likely causes get() to see fresh data as well as stale.
      • Replica sync (4.7) will fix missed get()s.
  • Is the design inherently low delay?

    • No: client may be forced to contact distant coordinator
    • No: some of the R/W nodes may be distant, coordinator must wait
  • What parts of design are likely to help limit 99.9th pctile delay?

    • This is a question about variance, not mean
    • Bad news: waiting for multiple servers takes max of delays, not e.g. avg
    • Good news: Dynamo only waits for W or R out of N
      • cuts off tail of delay distribution
      • e.g. if nodes have 1% chance of being busy with something else
      • or if a few nodes are broken, network overloaded, &c
  • No real Eval section, only Experience

  • How does Amazon use Dynamo?

    • shopping cart (merge)
    • session info (maybe Recently Visited &c?) (most recent TS) product list (mostly r/o, replication for high read throughput)
  • They claim main advantage of Dynamo is flexible N, R, W

    • What do you get by varying them?
  N-R-W
  3-2-2 : default, reasonable fast R/W, reasonable durability
  3-3-1 : fast W, slow R, not very durable, not useful?
  3-1-3 : fast R, slow W, durable
  3-3-3 : ??? reduce chance of R missing W?
  3-1-1 : not useful?
  • They had to fiddle with the partitioning / placement / load balance (6.2)

    • Old scheme:
      • Random choice of node ID meant new node had to split old nodes' ranges
      • Which required expensive scans of on-disk DBs
    • New scheme:
      • Pre-determined set of Q evenly divided ranges
      • Each node is coordinator for a few of them
      • New node takes over a few entire ranges
      • Store each range in a file, can xfer whole file
  • How useful is ability to have multiple versions? (6.3)

    • i.e. how useful is eventual consistency
    • This is a Big Question for them
    • 6.3 claims 0.001% of reads see divergent versions
      • I believe they mean conflicting versions (not benign multiple versions)
      • Is that a lot, or a little?
    • So perhaps 0.001% of writes benefitted from always-writeable?
      • i.e. would have blocked in primary/backup scheme?
    • Very hard to guess:
      • They hint that the problem was concurrent writers, for which better solution is single master
      • But also maybe their measurement doesn't count situations where availability would have been worse if single master
  • Performance / throughput (Figure 4, 6.1)

    • Figure 4 says average 10ms read, 20 ms writes
      • the 20 ms must include a disk write
      • 10 ms probably includes waiting for R/W of N
    • Figure 4 says 99.9th pctil is about 100 or 200 ms
      • Why?
      • "request load, object sizes, locality patterns"
      • does this mean sometimes they had to wait for coast-coast msg?
  • Puzzle: why are the average delays in Figure 4 and Table 2 so low?

    • Implies they rarely wait for WAN delays
    • But Section 6 says "multiple datacenters"
      • you'd expect most coordinators and most nodes to be remote!
      • Maybe all datacenters are near Seattle?
  • Wrap-up

    • Big ideas:
      • eventual consistency
      • always writeable despite failures
      • allow conflicting writes, client merges
    • Awkward model for some applications (stale reads, merges)
      • this is hard for us to tell from paper
    • Maybe a good way to get high availability + no blocking on WAN
      • but PNUTS master scheme implies Yahoo thinks not a problem
    • No agreement on whether eventual consistency is good for storage systems