Lecture 16 Chord, DHTs, and Naming
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications
Stoica, Morris, Karger, Kaashoek, and Balakrishnan
Why this paper?
- Good chance to discuss a topic we haven't talked about head on yet:
- Good chance to discuss consistent hashing.
- In wide use today in many systems.
- Tilt conversation a bit toward p2p.
- p2p kind of old school now, but really important since it taught us how to
build loosely coupled systems.
- We'll spend a lot of time today detoured, since this paper provides a good
opportunity to discuss some important issues we haven't hit head on yet.
Significant Topics/Tensions so far
- Consistency/Concurrency model
- Though more to come on this.
- Notably absent topic: naming
- Given some name determine a location of the corresponding object.
- Centralized, Decentralized, Hierarchical
- Explicit, Implicit
- Examples of naming systems:
- DNS: hostname -> IP
- Hierarchical, delegated authority, but centralized trust
- Lab 4: explicit, two-level mapping
- h(k) % nshards -> shardId, shardId -> host
- FDS: (blob id, tractnumber) -> host
- tlt[(h(blob id) + tractnumber) % nservers]
- Another idea: arbitrary map
- m[k] -> host
- Make it easy to group 'near' keys on the same machine.
- Can move things arbitrarily.
- Things to consider:
- Size of routing table, if centralized
- Amount of per-client/server routing state
- Locality: large centralized mapping not too bad, if clients only need
small part at at time.
- Churn: if high, need fast convergence, else inefficent
Chord: trying to solve naming in an environment we haven't talked about yet.
- p2p: random machines scattered all over the world.
- High churn.
- High latency.
- Terrified of centralized control.
- Napster shutdown...
First, let's cover a core idea used in Chord that's common in data center
- Then, we can see how the p2p goals change things.
Often in data center systems, hard to exploit locality, moderately large
number of hosts.
- e.g. Facebook memcached.
- Don't care about grouping keys too intelligently.
- In fact, breaking correlation on key distance may create hotspots.
- Idea: hash keys to choose server.
Problem: what if we need to add capacity, or a node crashes?
- mod forces all data to be reshuffled: each key now maps to a random server.
- Idea: don't map hashes to serverId directly.
- Map serverIds and keys into a single hash space.
- Then, map keys to servers based on proximity of their hashes.
- [draw rings example]
0 to 2**3 -1
Assume h(k) in [0, n).
c = table.serverId
for hs, s in table:
if h(k) > hs:
c = s
h(k) -> 3, then return A
h(k) -> 2, then return B
h(k) -> 6, then return C
h(k) -> 7, then return B
Now what if we add D, h(D) -> 0
h(k) -> 3, then return A (same)
h(k) -> 2, then return B (same)
h(k) -> 6, then return C (same)
h(k) -> 7, then return D
Node C gives ownership of hashes 7 and 0 to D, nothing else shifts around.
When we add/remove a node only about 1/Nth of the key space moves.
Popular, especially for spreading load on KVS/memcached.
- Naturally spreads key/values and load.
- Low rebalance cost on join/leave.
Chord uses this at its core, but this isn't enough for them.
- How do we deal with the fact we can keep this central table?
- If all nodes knew about all nodes, this could work.
- In p2p, maybe 1e6 nodes... coming and going all the time.
- Also, how does a node join/leave without a central authority?
First, let's ask: how well does this really spread load?
- Keys are scattered randomly so this has be optimal, right?
- Imbalance in key/value sizes.
- What if someone tries to store 1 TB in a value and the rest of the
values are 1KB?
- Imbalance in access rates.
- Even without that: assume same access rate, same sizes?
- [Figure 8a, b]
- 8a: as more and more values added, 99th percentile keys per node is growing
faster than average keys per node! What's going on here?
- n balls into n bins? |worst bucket| ~= (log n)/(log log n)
- Even if number of servers grows at same rate as key space!
- 2^32 keys and 2^32 servers? Expect some servers to have 10 or more.
Sidebar: power of two choices.
- Cool load balancing technique that can be used in tons of places.
- Look in two buckets, pick the less loaded of the two and place ball there.
- Worst case bucket growth goes to (log log n)/(log 2), or less than 3.
But, no choice here: bucket size deterministic function.
- Idea: place each server at multiple points in the hash space.
- Sum of hash space pieces more likely to be balanced.
- [Adhoc diagram on a ring.]
- Very common optimization.
- Table state grows linear to number of virtual nodes.
- [Figure 9.]
Ok - end side bar on data center naming.
Back to Chord's problem.
- Use this consistent hashing ring.
- Need to decentralize table state.
- Need to tolerate decentralized join/leave.
Idea: what if each node only keeps track of its successor?
- Can basically run the same algorithm we had before.
- Given some k with h(k) and the address of any node n in the ring.
- Start at n.
- If between 'me' and 'successor' then return successor, else iterate.
- if h(k) >= h(n) and h(k) < h(n.successor) then return n.successor
- else retry at n.successor.
Problem: if > 10,000 hosts, > 10,000 round trips to find an object:
- 16 minutes to do a single lookup with 100 ms RTT
Idea: keep 'fingers'/chords that allow us to 'cut across' the ring as a
- Q: Where do we point the fingers?
- Q: What about evenly across the key space?
- Only gives a linear speedup for linear increase in per-node state.
- Idea: have nodes keep more information on nearby neighbors, less on those
- Just make sure each next node queried gets us a bit closer, guaranteed to
eventually find what we want.
- [Chord diagram.]
- Ask node n about a key far away: ask n' -> he's in that half.
- Ask node n about a key next door: just sends you one hop over.
Each node keeps log(N) entries, each covering twice as much of the ring as
the entry before.
- In each entry, track the first node at or past the point to which the entry
- Every interval of the key space has some node assigned to it.
- Map the requests for keys into the hash space.
- Use the local table to find the successor for that hash.
- Route the request there, skipping over up to half of the nodes.
- Figure 3b, walkthrough lookup h(k) -> 4
- 0.lookup(4) -> 0
- 3.lookup(4) -> 0
- 1.lookup(4) -> 3, 3.lookup(4) -> 0 (3's interval doesn't contain 4)
- [Draw visualization from Figure 4.]
- Simple approach:
- For node n, lookup(n) -> ns.
- Find ns.predecessor -> np (Track predecessors to make this fast.)
- n.successor = ns
- n.predecessor = np
- ns.predecessor = n
- np.successor = n
- Tell KVS above to transfer state.
- Two remaining issues
- n's finger table.
- Mostly n can steal np's table.
- Why? Many entries np's table already report the same successor.
- For any entry of n's table subsumed by an entry of np's it can copy the
- e.g. np = 1, and says [2, 4) -> 8, then n = 2, can put [3, 4) -> 8, etc.
- Finger table of earlier nodes.
- Not too big a deal, since system will still work fine.
- Just undershoot by one nodes sometimes for 1/N queries.
- Roughly, work backward.
- First, find the predescessor for the point halfway across the ring.
- Make sure his 'furthest' entry lists self as sucessor.
- Do this for all fingers on new node.
- Whenever an update is needed, work backward.
- [Figure 5a shows this well, some diagram on paper.]
Node failures make this hard.
- In practice, complicated enough, that they basically switch to polling.
- Has to work with concurrent joins/leaves.
- Goal: Just make sure the successor pointers stay ok. Rest can be fixed up.
- Occasionally, do a findSuccessor 'lookup' on things in each node's finger
- If a different successor is reported, record it instead.
- [Can walk through bottom para of left column on page 7 if time.]
What happens in the case of a network partition?
- Paper says its unclear if disjoint cycles will emerge.
- It seems like this must be the case?
Keep next log N successors at each node as well.
- Successor is the one true way to ensure queries work.
- If some go away, can patch up quickly without getting disconnected.
- Figure 13.
- 200 nodes.
- About 3x slower than keeping complete table 60 ms -> 180 ms.
- Savings? Full table 200 * 32 bits = 800 bytes on each server.
- Chord: lg(200) * 2 * 8 = 64 bytes per server + 4 for pred = 68 bytes
- Extra 2x for the r successors used in failure cases.
- Worthwhile in 2001?
- What about for 10,000 nodes?
What's really going on here?
- This is a big distributed index.
- Can lookup in log N time.
- Only need log N space for naming.
Upcoming lectures: consistency/concurrency models
- We need to think about what it means when operations interleave.
- Weaker models admit more schedules because more orders are ok.
- Problem: do we get 'correct' results?
- Depends on the application/algorithms.
In general, the weaker the model, the harder it is to reason about.
- Equivalent to some total order, must match real time.
- Equivalent to some total order, but may not match real time.
- If you do something, and I observe the effect, then if others observe my
effects, they need to see your effect as well.
- Operations apply out of order, but ok.
- e.g. Commutative and associative operations.
- Set addition (not bag addition)
- What if we include set removal?...
Where is a modern CPU on this spectrum?