CS6963 Distributed Systems

Lecture 16 Chord, DHTs, and Naming

Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications
Stoica, Morris, Karger, Kaashoek, and Balakrishnan

  • Why this paper?

    • Good chance to discuss a topic we haven't talked about head on yet:
      • Naming
    • Good chance to discuss consistent hashing.
      • In wide use today in many systems.
    • Tilt conversation a bit toward p2p.
    • p2p kind of old school now, but really important since it taught us how to build loosely coupled systems.
    • We'll spend a lot of time today detoured, since this paper provides a good opportunity to discuss some important issues we haven't hit head on yet.
  • Significant Topics/Tensions so far

    • Reliability/replication
    • Consistency/Concurrency model
      • Though more to come on this.
    • Partitioning/scaling
    • Time/clocks
    • Notably absent topic: naming
  • Naming/routing

    • Given some name determine a location of the corresponding object.
      • lookup(name) -> address
    • Approaches
      • Centralized, Decentralized, Hierarchical
      • Explicit, Implicit
    • Examples of naming systems:
      • DNS: hostname -> IP
        • Hierarchical, delegated authority, but centralized trust
      • Lab 4: explicit, two-level mapping
        • h(k) % nshards -> shardId, shardId -> host
      • FDS: (blob id, tractnumber) -> host
        • tlt[(h(blob id) + tractnumber) % nservers]
      • Another idea: arbitrary map
        • m[k] -> host
        • Make it easy to group 'near' keys on the same machine.
        • Can move things arbitrarily.
    • Things to consider:
      • Size of routing table, if centralized
      • Amount of per-client/server routing state
      • Locality: large centralized mapping not too bad, if clients only need small part at at time.
      • Churn: if high, need fast convergence, else inefficent
  • Chord: trying to solve naming in an environment we haven't talked about yet.

    • p2p: random machines scattered all over the world.
    • High churn.
    • High latency.
    • Flaky.
    • Terrified of centralized control.
    • Napster shutdown...
  • First, let's cover a core idea used in Chord that's common in data center systems.

    • Then, we can see how the p2p goals change things.
  • Often in data center systems, hard to exploit locality, moderately large number of hosts.

    • e.g. Facebook memcached.
    • Don't care about grouping keys too intelligently.
    • In fact, breaking correlation on key distance may create hotspots.
    • Idea: hash keys to choose server.
  • Problem: what if we need to add capacity, or a node crashes?

    • mod forces all data to be reshuffled: each key now maps to a random server.
  • Consistent Hashing

    • Idea: don't map hashes to serverId directly.
    • Map serverIds and keys into a single hash space.
    • Then, map keys to servers based on proximity of their hashes.
    • [draw rings example]

Example:

0 to 2**3 -1

Assume h(k) in [0, n).

lookup table:
h(serverId) serverId
2            B
5            A
6            C

lookup(k):
  c = table[0].serverId
  for hs, s in table:
    if h(k) > hs:
      break
    c = s
  return c

h(k) -> 3, then return A
h(k) -> 2, then return B
h(k) -> 6, then return C
h(k) -> 7, then return B

Now what if we add D, h(D) -> 0

lookup table:
h(serverId) serverId
0            D
2            B
5            A
6            C

h(k) -> 3, then return A (same)
h(k) -> 2, then return B (same)
h(k) -> 6, then return C (same)
h(k) -> 7, then return D

Node C gives ownership of hashes 7 and 0 to D, nothing else shifts around.
  • When we add/remove a node only about 1/Nth of the key space moves.

  • Popular, especially for spreading load on KVS/memcached.

    • Naturally spreads key/values and load.
    • Low rebalance cost on join/leave.
  • Chord uses this at its core, but this isn't enough for them.

    • How do we deal with the fact we can keep this central table?
    • If all nodes knew about all nodes, this could work.
    • In p2p, maybe 1e6 nodes... coming and going all the time.
    • Also, how does a node join/leave without a central authority?
  • First, let's ask: how well does this really spread load?

    • Keys are scattered randomly so this has be optimal, right?
    • Imbalance in key/value sizes.
      • What if someone tries to store 1 TB in a value and the rest of the values are 1KB?
    • Imbalance in access rates.
    • Even without that: assume same access rate, same sizes?
    • [Figure 8a, b]
    • 8a: as more and more values added, 99th percentile keys per node is growing faster than average keys per node! What's going on here?
    • n balls into n bins? |worst bucket| ~= (log n)/(log log n)
    • Even if number of servers grows at same rate as key space!
    • 2^32 keys and 2^32 servers? Expect some servers to have 10 or more.
  • Sidebar: power of two choices.

    • Cool load balancing technique that can be used in tons of places.
    • Look in two buckets, pick the less loaded of the two and place ball there.
    • Worst case bucket growth goes to (log log n)/(log 2), or less than 3.
  • But, no choice here: bucket size deterministic function.

    • Idea: place each server at multiple points in the hash space.
      • Virtual nodes.
    • Sum of hash space pieces more likely to be balanced.
    • [Adhoc diagram on a ring.]
    • Very common optimization.
    • Table state grows linear to number of virtual nodes.
    • [Figure 9.]
  • Ok - end side bar on data center naming.

  • Back to Chord's problem.

    • Use this consistent hashing ring.
    • Need to decentralize table state.
    • Need to tolerate decentralized join/leave.
  • Idea: what if each node only keeps track of its successor?

    • Can basically run the same algorithm we had before.
    • Given some k with h(k) and the address of any node n in the ring.
    • Start at n.
    • If between 'me' and 'successor' then return successor, else iterate.
    • if h(k) >= h(n) and h(k) < h(n.successor) then return n.successor
    • else retry at n.successor.
  • Problem: if > 10,000 hosts, > 10,000 round trips to find an object:

    • 16 minutes to do a single lookup with 100 ms RTT
  • Idea: keep 'fingers'/chords that allow us to 'cut across' the ring as a shortcut.

    • Q: Where do we point the fingers?
    • Q: What about evenly across the key space?
      • Only gives a linear speedup for linear increase in per-node state.
    • Idea: have nodes keep more information on nearby neighbors, less on those 'far away'.
    • Just make sure each next node queried gets us a bit closer, guaranteed to eventually find what we want.
    • [Chord diagram.]
    • Ask node n about a key far away: ask n' -> he's in that half.
    • Ask node n about a key next door: just sends you one hop over.
  • Each node keeps log(N) entries, each covering twice as much of the ring as the entry before.

    • In each entry, track the first node at or past the point to which the entry refers.
    • Every interval of the key space has some node assigned to it.
    • Map the requests for keys into the hash space.
    • Use the local table to find the successor for that hash.
    • Route the request there, skipping over up to half of the nodes.
    • Figure 3b, walkthrough lookup h(k) -> 4
      • 0.lookup(4) -> 0
      • 3.lookup(4) -> 0
      • 1.lookup(4) -> 3, 3.lookup(4) -> 0 (3's interval doesn't contain 4)
    • [Draw visualization from Figure 4.]
  • Node joins:

    • Simple approach:
    • For node n, lookup(n) -> ns.
    • Find ns.predecessor -> np (Track predecessors to make this fast.)
    • n.successor = ns
    • n.predecessor = np
    • ns.predecessor = n
    • np.successor = n
    • Tell KVS above to transfer state.
    • Two remaining issues
    • n's finger table.
      • Mostly n can steal np's table.
      • Why? Many entries np's table already report the same successor.
      • For any entry of n's table subsumed by an entry of np's it can copy the ssuccessor.
      • e.g. np = 1, and says [2, 4) -> 8, then n = 2, can put [3, 4) -> 8, etc.
    • Finger table of earlier nodes.
      • Not too big a deal, since system will still work fine.
      • Just undershoot by one nodes sometimes for 1/N queries.
      • Roughly, work backward.
      • First, find the predescessor for the point halfway across the ring.
      • Make sure his 'furthest' entry lists self as sucessor.
      • Do this for all fingers on new node.
      • Whenever an update is needed, work backward.
      • [Figure 5a shows this well, some diagram on paper.]
  • Node failures make this hard.

    • In practice, complicated enough, that they basically switch to polling.
      • Has to work with concurrent joins/leaves.
    • Goal: Just make sure the successor pointers stay ok. Rest can be fixed up.
    • Occasionally, do a findSuccessor 'lookup' on things in each node's finger table.
    • If a different successor is reported, record it instead.
    • [Can walk through bottom para of left column on page 7 if time.]
  • What happens in the case of a network partition?

    • Paper says its unclear if disjoint cycles will emerge.
    • It seems like this must be the case?
  • Keep next log N successors at each node as well.

    • Successor is the one true way to ensure queries work.
    • If some go away, can patch up quickly without getting disconnected.
  • Performance

    • Figure 13.
    • 200 nodes.
    • About 3x slower than keeping complete table 60 ms -> 180 ms.
    • Savings? Full table 200 * 32 bits = 800 bytes on each server.
    • Chord: lg(200) * 2 * 8 = 64 bytes per server + 4 for pred = 68 bytes
      • Extra 2x for the r successors used in failure cases.
    • Worthwhile in 2001?
    • What about for 10,000 nodes?
  • What's really going on here?

    • This is a big distributed index.
    • Can lookup in log N time.
    • Only need log N space for naming.

Upcoming lectures: consistency/concurrency models

  • We need to think about what it means when operations interleave.
  • Weaker models admit more schedules because more orders are ok.
  • Problem: do we get 'correct' results?
  • Depends on the application/algorithms.
  • In general, the weaker the model, the harder it is to reason about.

  • Linearizability/Serializability+External Consistency

    • Equivalent to some total order, must match real time.
  • Sequential Consistency/Serializability

    • Equivalent to some total order, but may not match real time.
  • Causal Consistency

    • If you do something, and I observe the effect, then if others observe my effects, they need to see your effect as well.
  • Eventual Consistency

    • Operations apply out of order, but ok.
    • e.g. Commutative and associative operations.
    • Set addition (not bag addition)
    • What if we include set removal?...
  • Where is a modern CPU on this spectrum?