Scalability! But at what COST?
McSherry, Isard, and Murray
Overall, this paper should increase your skepticism when reading recent distributed systems work and help you separate important improvements from misleading measurements.
Key idea: recent software systems have focused on "scaling" for performance, but these systems running on many nodes are slower than a system running on one node (or even a single thread).
Big data/data parallelism
Scaling is a popular topic in recent top systems conferences.
Raises the question: what's the speedup potential with parallelism?
Parallel:
|-----|
|----------------|-----|--|
setup |-----|
Sequential:
|-|-----.-----.-----|-|
Time
| o
|
|
| o
|
| o
| o
|-----------------------
1 Workers n
Is this 1/n curve what we'll get?
Time
| x x
| x
| x
| x
| x
| x
|
|-----------------------
1 Workers n
Time
| x x
| x z
| x z
| x
| C x
| x
|
|-----------------------
1 Workers n
Note in this paper when they talk about cores, the cores aren't all in the same machine
[Figure 1]
Surprising result of this paper: a simple single threaded implementation may be faster for all points on the graph for some systems.
First, we're going to talk about graph analytics frameworks, since that's the example from this paper. Don't get hung up on graph processing. The high-level point is about scaling and efficiency. Need this for context of the conversation.
For simplicity, assume synchronous rounds.
What is this good for?
o--o o---o
| / |
o o o---o
fn pagerank<G: EdgeMapper>(graph: &G, nodes: u32, alpha: f32) {
let mut src: Vec<f32> = (0..nodes).map(|_| 0f32).collect();
let mut dst: Vec<f32> = (0..nodes).map(|_| 0f32).collect();
let mut deg: Vec<f32> = (0..nodes).map(|_| 0f32).collect();
// Initialize deg of all nodes to count of outbound neighbors.
graph.map_edges(|x, _| { deg[x] += 1f32 });
for _iteration in (0 .. 20) {
for node in (0 .. nodes) {
// Divide up current weight of the node among neighbors.
src[node] = alpha * dst[node] / deg[node];
// Each node starts with 1 - alpha weight.
dst[node] = 1f32 - alpha;
}
// Then add on the weight coming in from neighbors.
// i.e. Each neighbor y of x receives a share of x's original weight.
graph.map_edges(|x, y| { dst[y] += src[x]; });
}
}
Look at last map_edges: does the heavy part: this is a scan over the edges of the graph.
Table 2
Label propagation easy on vertex-centric model
Start
1--3 4---6
| / |
2 5 7---8---9
After round 1
1--1 4---4
| / |
1 4 7---7---8
Used super naive implementations; simple improvements make them even better.
0 --> 1
| ^
V |
2 --> 3
Source order: Interleaved:
0, 1 0001b
0, 2 0100b
2, 3 1101b
3, 1 1011b
Z-curve:
0, 1
0, 2
3, 1
2, 3
In the limit:
0, 1
0, 10
0, 20
...
0, 1000
...
0, 100000
...
2, 1
2, 10
...
2, 100000
0, 1
2, 1
0, 10
2, 10
0, 20
...
0, 1000
...
0, 100000
2, 100000
...
For x, y edges with close x's and close y's are likely to hit in cache.
Source order: have to read each destination node for edge from any source. Z-curve: if node 10 has an edge to 200 and so does node 20, then the accesses will be proximate and likely to hit in the buffer cache.
Union-find - remember this from algorithms class?
Table 5