Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications

Stoica, Morris, Karger, Kaashoek, and Balakrishnan

Why this paper?

- Good chance to discuss a topic we haven't talked about head on yet:
- Naming

- Good chance to discuss consistent hashing.
- In wide use today in many systems.

- Tilt conversation a bit toward p2p.
- p2p kind of old school now, but really important since it taught us how to build loosely coupled systems.
- We'll spend a lot of time today detoured, since this paper provides a good opportunity to discuss some important issues we haven't hit head on yet.

- Good chance to discuss a topic we haven't talked about head on yet:
Significant Topics/Tensions so far

- Reliability/replication
- Consistency/Concurrency model
- Though more to come on this.

- Partitioning/scaling
- Time/clocks
- Notably absent topic: naming

Naming/routing

- Given some name determine a location of the corresponding object.
- lookup(name) -> address

- Approaches
- Centralized, Decentralized, Hierarchical
- Explicit, Implicit

- Examples of naming systems:
- DNS: hostname -> IP
- Hierarchical, delegated authority, but centralized trust

- Lab 4: explicit, two-level mapping
- h(k) % nshards -> shardId, shardId -> host

- FDS: (blob id, tractnumber) -> host
- tlt[(h(blob id) + tractnumber) % nservers]

- Another idea: arbitrary map
- m[k] -> host
- Make it easy to group 'near' keys on the same machine.
- Can move things arbitrarily.

- DNS: hostname -> IP
- Things to consider:
- Size of routing table, if centralized
- Amount of per-client/server routing state
- Locality: large centralized mapping not too bad, if clients only need small part at at time.
- Churn: if high, need fast convergence, else inefficent

- Given some name determine a location of the corresponding object.
Chord: trying to solve naming in an environment we haven't talked about yet.

- p2p: random machines scattered all over the world.
- High churn.
- High latency.
- Flaky.
- Terrified of centralized control.
- Napster shutdown...

First, let's cover a core idea used in Chord that's common in data center systems.

- Then, we can see how the p2p goals change things.

Often in data center systems, hard to exploit locality, moderately large number of hosts.

- e.g. Facebook memcached.
- Don't care about grouping keys too intelligently.
- In fact, breaking correlation on key distance may create hotspots.
- Idea: hash keys to choose server.

Problem: what if we need to add capacity, or a node crashes?

- mod forces all data to be reshuffled: each key now maps to a random server.

Consistent Hashing

- Idea: don't map hashes to serverId directly.
- Map serverIds and keys into a single hash space.
- Then, map keys to servers based on proximity of their hashes.
- [draw rings example]

Example:

```
0 to 2**3 -1
Assume h(k) in [0, n).
lookup table:
h(serverId) serverId
2 B
5 A
6 C
lookup(k):
c = table[0].serverId
for hs, s in table:
if h(k) > hs:
break
c = s
return c
h(k) -> 3, then return A
h(k) -> 2, then return B
h(k) -> 6, then return C
h(k) -> 7, then return B
```

Now what if we add D, h(D) -> 0

```
lookup table:
h(serverId) serverId
0 D
2 B
5 A
6 C
h(k) -> 3, then return A (same)
h(k) -> 2, then return B (same)
h(k) -> 6, then return C (same)
h(k) -> 7, then return D
Node C gives ownership of hashes 7 and 0 to D, nothing else shifts around.
```

When we add/remove a node only about 1/Nth of the key space moves.

Popular, especially for spreading load on KVS/memcached.

- Naturally spreads key/values and load.
- Low rebalance cost on join/leave.

Chord uses this at its core, but this isn't enough for them.

- How do we deal with the fact we can keep this central table?
- If all nodes knew about all nodes, this could work.
- In p2p, maybe 1e6 nodes... coming and going all the time.
- Also, how does a node join/leave without a central authority?

First, let's ask: how well does this really spread load?

- Keys are scattered randomly so this has be optimal, right?
- Imbalance in key/value sizes.
- What if someone tries to store 1 TB in a value and the rest of the values are 1KB?

- Imbalance in access rates.
- Even without that: assume same access rate, same sizes?
- [Figure 8a, b]
- 8a: as more and more values added, 99th percentile keys per node is growing faster than average keys per node! What's going on here?
- n balls into n bins? |worst bucket| ~= (log n)/(log log n)
- Even if number of servers grows at same rate as key space!
- 2^32 keys and 2^32 servers? Expect some servers to have 10 or more.

Sidebar: power of two choices.

- Cool load balancing technique that can be used in tons of places.
- Look in two buckets, pick the less loaded of the two and place ball there.
- Worst case bucket growth goes to (log log n)/(log 2), or less than 3.

But, no choice here: bucket size deterministic function.

- Idea: place each server at multiple points in the hash space.
- Virtual nodes.

- Sum of hash space pieces more likely to be balanced.
- [Adhoc diagram on a ring.]
- Very common optimization.
- Table state grows linear to number of virtual nodes.
- [Figure 9.]

- Idea: place each server at multiple points in the hash space.
Ok - end side bar on data center naming.

Back to Chord's problem.

- Use this consistent hashing ring.
- Need to decentralize table state.
- Need to tolerate decentralized join/leave.

Idea: what if each node only keeps track of its successor?

- Can basically run the same algorithm we had before.
- Given some k with h(k) and the address of any node n in the ring.
- Start at n.
- If between 'me' and 'successor' then return successor, else iterate.
- if h(k) >= h(n) and h(k) < h(n.successor) then return n.successor
- else retry at n.successor.

Problem: if > 10,000 hosts, > 10,000 round trips to find an object:

- 16 minutes to do a single lookup with 100 ms RTT

Idea: keep 'fingers'/chords that allow us to 'cut across' the ring as a shortcut.

- Q: Where do we point the fingers?
- Q: What about evenly across the key space?
- Only gives a linear speedup for linear increase in per-node state.

- Idea: have nodes keep more information on nearby neighbors, less on those 'far away'.
- Just make sure each next node queried gets us a bit closer, guaranteed to eventually find what we want.
- [Chord diagram.]
- Ask node n about a key far away: ask n' -> he's in that half.
- Ask node n about a key next door: just sends you one hop over.

Each node keeps log(N) entries, each covering twice as much of the ring as the entry before.

- In each entry, track the first node at or past the point to which the entry refers.
- Every interval of the key space has some node assigned to it.
- Map the requests for keys into the hash space.
- Use the local table to find the successor for that hash.
- Route the request there, skipping over up to half of the nodes.
- Figure 3b, walkthrough lookup h(k) -> 4
- 0.lookup(4) -> 0
- 3.lookup(4) -> 0
- 1.lookup(4) -> 3, 3.lookup(4) -> 0 (3's interval doesn't contain 4)

- [Draw visualization from Figure 4.]

Node joins:

- Simple approach:
- For node n, lookup(n) -> ns.
- Find ns.predecessor -> np (Track predecessors to make this fast.)
- n.successor = ns
- n.predecessor = np
- ns.predecessor = n
- np.successor = n
- Tell KVS above to transfer state.
- Two remaining issues
- n's finger table.
- Mostly n can steal np's table.
- Why? Many entries np's table already report the same successor.
- For any entry of n's table subsumed by an entry of np's it can copy the ssuccessor.
- e.g. np = 1, and says [2, 4) -> 8, then n = 2, can put [3, 4) -> 8, etc.

- Finger table of earlier nodes.
- Not too big a deal, since system will still work fine.
- Just undershoot by one nodes sometimes for 1/N queries.
- Roughly, work backward.
- First, find the predescessor for the point halfway across the ring.
- Make sure his 'furthest' entry lists self as sucessor.
- Do this for all fingers on new node.
- Whenever an update is needed, work backward.
- [Figure 5a shows this well, some diagram on paper.]

Node failures make this hard.

- In practice, complicated enough, that they basically switch to polling.
- Has to work with concurrent joins/leaves.

- Goal: Just make sure the successor pointers stay ok. Rest can be fixed up.
- Occasionally, do a findSuccessor 'lookup' on things in each node's finger table.
- If a different successor is reported, record it instead.
- [Can walk through bottom para of left column on page 7 if time.]

- In practice, complicated enough, that they basically switch to polling.
What happens in the case of a network partition?

- Paper says its unclear if disjoint cycles will emerge.
- It seems like this must be the case?

Keep next log N successors at each node as well.

- Successor is the one true way to ensure queries work.
- If some go away, can patch up quickly without getting disconnected.

Performance

- Figure 13.
- 200 nodes.
- About 3x slower than keeping complete table 60 ms -> 180 ms.
- Savings? Full table 200 * 32 bits = 800 bytes on each server.
- Chord: lg(200) * 2 * 8 = 64 bytes per server + 4 for pred = 68 bytes
- Extra 2x for the r successors used in failure cases.

- Worthwhile in 2001?
- What about for 10,000 nodes?

What's really going on here?

- This is a big distributed index.
- Can lookup in log N time.
- Only need log N space for naming.

- We need to think about what it means when operations interleave.
- Weaker models admit more schedules because more orders are ok.
- Problem: do we get 'correct' results?
- Depends on the application/algorithms.
In general, the weaker the model, the harder it is to reason about.

Linearizability/Serializability+External Consistency

- Equivalent to some total order, must match real time.

Sequential Consistency/Serializability

- Equivalent to some total order, but may not match real time.

Causal Consistency

- If you do something, and I observe the effect, then if others observe my effects, they need to see your effect as well.

Eventual Consistency

- Operations apply out of order, but ok.
- e.g. Commutative and associative operations.
- Set addition (not bag addition)
- What if we include set removal?...

Where is a modern CPU on this spectrum?