Lecture 12 MapReduce, Parallel Batch Processing

Why MapReduce?
- Second look for fault tolerance and performance
- Starting point in current enthusiasm for big cluster computing
- A triumph of simplicity for programmer
- Bulk orientation well matched to cluster with slow network
- Very influential, inspired many successors (Hadoop, Spark, &c)
Cluster computing for Big Data
- 1000 computers + disks
- a LAN
- split up data+computation among machines
- communicate as needed
- similar to DSM vision but much bigger, no desire for compatibility
Example: inverted index
- e.g. index terabytes of web pages for a search engine
- Input:
  - A collection of documents, e.g. crawled copy of entire web
  - doc 31: i am alex
  - doc 32: alex at 8 am
- Output:
  - alex: 31/3 32/1 ...
  - am: 31/2 32/4 ...
- Map(document file i):
  - split into words
  - for each offset j
    - emit key=word[j] value=i/j
- Reduce(word, list of d/o)
  - emit word, sorted list of d/o
Diagram:
- input partitioned into M splits on GFS: A, B, C, ...
- Maps read local split, produce R local intermediate files (A0, A1 .. AR)
- Reduce # = hash(key) % R
- Reduce task i fetches Ai, Bi, Ci -- from every Map worker
- Sort the fetched files to bring same key together
- Call Reduce function on each key's values
- Write output to GFS
- Master controls all:
  - Map task list
  - Reduce task list
  - Location of intermediate data (which Map worker ran which Map task)
Notice:
- Input is huge -- terabytes
- Info from all parts of input contributes to each output index entry
  - So terabytes must be communicated between machines
- Output is huge -- terabytes
The main challenge: communication bottleneck
- Three kinds of data movement needed:
  - Read huge input
  - Move huge intermediate data
  - Store huge output
- How fast can one move data?
  - RAM: 1000x1 GB/sec = 1000 GB/sec
  - disk: 1000x0.1 GB/sec = 100 GB/sec
  - net cross-section: 10 GB/sec
- Explain host link b/w vs net cross-section b/w
What are the crucial design decisions in MapReduce?
- Contrast to KVS get/put
  - They allow arbitrary random interaction among threads/clients.
  - But: latency sensitive, poor throughput efficiency.
- Maps and Reduces work on local data -> reduced network communication.
  - For Map, split storage and computation in the same way, use local disk.
- Maps and Reduces work on big batches of data -> no small latency-sensitive network messages.
- Very little interaction:
  - Maps and Reduces can't interact with each other directly.
  - No interaction across phase boundaries.
  - -> Can re-execute single Map/Reduce independently, no need for e.g. global checkpoint.
  - (Why would this be hard in a general distributed program?)
- Programmer can't directly cause network communication, but has indirect control since Map specifies key.
Where does MapReduce input come from?
- Input is striped+replicated over GFS in 64 MB chunks
- But in fact Map always reads from a local disk
  - They run the Maps on the GFS server that holds the data
- Tradeoff:
  - Good: Map reads at disk speed, much faster than over net from GFS server
  - Bad: only two or three choices of where a given Map can run potential problem for load balance, stragglers
Where does MapReduce store intermediate data?
- On the local disk of the Map server (not in GFS)
- Tradeoff:
  - Good: local disk write is faster than writing over network to GFS server
  - Bad: only one copy, potential problem for fault-tolerance and load-balance
Where does MapReduce store output?
- In GFS, replicated, separate file per Reduce task
- So output requires network communication -- slow
- The reason: output can then be used as input for subsequent MapReduce
The Question: How soon after it receives the first file of intermediate data can a reduce worker start calling the application's Reduce function?
Why does MapReduce postpone choice of which worker runs a Reduce?
- After all, might run faster if Map output directly streamed to reduce worker
- Dynamic load balance!
- If fixed in advance, one machine 2x slower -> 2x delay for whole computation and maybe the rest of the cluster idle/wasted half the time
Will MR scale?
- Will buying 2x machines yield 1/2 the run-time, indefinitely?
- Map calls probably scale
  - 2x machines -> each Map's input 1/2 as big -> done in 1/2 the time
  - but: input may not be infinitely partitionable
  - but: tiny input and intermediate files have high overhead
- Reduce calls probably scale
  - 2x machines -> each handles 1/2 as many keys -> done in 1/2 the time
  - but: can't have more workers than keys
  - but: limited if some keys have more values than others
  - e.g. "the" has vast number of values for inverted index so 2x machines -> no faster, since limited by key w/ most values
- Network may limit scaling, if large intermediate data
  - Must spend money on faster core switches as well as more machines
  - Not easy -- a hot R+D area now
- Stragglers are a problem, if one machine is slow, or load imbalance
  - Can't solve imbalance w/ more machines
- Start-up time is about a minute!!! Can't reduce w/ more machines (probably makes it worse)
- More machines -> more failures
Now let's talk about fault tolerance
- The challenge: paper says one server failure per job!
- Too frequent for whole-job restart to be attractive
The main idea: Map and Reduce are deterministic, functional, and independent, so MapReduce can deal with failures by re-executing
- Often a choice:
  - Re-execute big tasks, or
  - Save output, replicate, use small tasks
- Best tradeoff depends on frequency of failures and expense of communication
What if a worker fails while running Map?
- Can we restart just that Map on another machine?
  - Yes: GFS keeps copy of each input split on 3 machines
- Master knows, tells Reduce workers where to find intermediate files
If a Map finishes, then that worker fails, do we need to re-run that Map?
- Intermediate output now inaccessible on worker's local disk.
- Thus need to re-run Map elsewhere unless all Reduce workers have already fetched that Map's output.
What if Map had started to produce output, then crashed:
- Will some Reduces see Map's output twice?
- And thus produce e.g. word counts that are too high?
- (A: rely on first-to-notify-master-wins)
What if a worker fails while running Reduce?
- Where can a replacement worker find Reduce input?
- If a Reduce finishes, then worker fails, do we need to re-run?
  - No: Reduce output is stored+replicated in GFS.
- (Rely on atomic rename of output files.)
Load balance
- What if some Map machines are faster than others?
  - Or some input splits take longer to process?
- Don't want lots of idle machines and lots of work left to do!
- Solution: many more input splits than machines
- Master hands out more Map tasks as machines finish
- Thus faster machines do bigger share of work
- But there's a constraint:
  - Want to run Map task on machine that stores input data
  - GFS keeps 3 replicas of each input data split
  - So only three efficient choices of where to run each Map task
  - Where do they go after this? Try to land in same rack.
Stragglers
- Often one machine is slow at finishing very last task h/w or s/w wedged, overloaded with some other work
- Load balance only balances newly assigned tasks
- Solution: always schedule multiple copies of very last tasks!
How many Map/Reduce tasks vs workers should we have?
- They use M = 10x number of workers, R = 2x.
- More => finer grained load balance.
- More => less redundant work for straggler reduction.
- More => spread tasks of failed worker over more machines, re-execute faster.
- More => overlap Map and shuffle, shuffle and Reduce.
- Less => big intermediate files w/ less overhead.
- M and R also maybe constrained by how data is striped in GFS.
  - e.g. 64 MByte GFS chunks means M needs to total data size / 64 MBytes

Performance evaluation

Figure 2 / Section 5.2
- Text search for rare 3-char pattern, just Map, no shuffle or reduce
- One terabyte of input
- 1800 machines
- Figure 2 x-axis is time, y-axis is input read rate
- 60 seconds start-up time omitted! (copying program, opening input files)
- Why does it take so long (60 seconds) to reach the peak rate?
  - Takes 60 seconds to hand out 1800 map tasks.
  - Is this reasonable? That's about 1 map started per 33 ms.
- Why does it go up to 30,000 MB/s? Why not 3,000 or 300,000?
  - That's 17 MB/sec per server.
  - What limits the peak rate?
Figure 3(a) / Section 5.3
- sorting a terabyte
- Should we be impressed by 800 seconds?
  - Read 1 TB, Write 1 TB, Write 1 TB to disk twice: 4 copies
    - 5 MB/s, but a lot going to wire time
  - All data traverses the network twice
    - 2.5 GB/s = about 20 Gbps cross section bw?
- Top graph -- Input rate
  - Why peak of 10,000 MB/s?
  - Why less than Figure 2's 30,000 MB/s? (writes disk)
  - Why does read phase last abt 100 seconds?
- Middle graph -- Shuffle rate
  - How is shuffle able to start before Map phase finishes?
    - more map tasks than workers
  - Why does it peak at 5,000 MB/s?
    - net cross-sec b/w abt 18 GB/s
    - with 2x copies they are pushing 10 GB/s while competing with shuffle
  - Why a gap, then starts again?
    - runs some Reduce tasks, then fetches more
    - Doesn't this mean we can make things faster with more reducers?
  - Why is the 2nd bump lower than first?
    - maybe competing w/ overlapped output writes
- Lower graph -- Reduce output rate
  - How can reduces start before shuffle has finished?
    - again, shuffle gets all files for some tasks
  - Why is output rate so much lower than input rate?
    - net rather than disk; writes twice to GFS
- Why the gap between apparent end of output and vertical "Done" line? stragglers?
What should we buy if we wanted sort to run faster?
- Let's guess how much each resource limits performance.
- Reading input from disk: 30 GB/sec = 33 seconds (Figure 2)
- Map computation: between zero and 150 seconds (Figure 3(a) top)
- Writing intermediate to disk: ? (maybe 30 Gb/sec = 33 seconds)
- Map->Reduce across net: 5 GB/sec = 200 seconds
- Local sort: 2x100 seconds (gap in Figure 3(a) middle)
- Writing output to GFS twice: 2.5 GB/sec = 400 seconds
- Stragglers: 150 seconds? (Figure 3(a) bottom tail)
- The answer: the network accounts for 600 of 850 seconds
Is it disappointing that sort uses only a small fraction of cluster CPU power?
- After all, only 200 of 800 seconds were spent sorting.
- Alternate view: they made good use of RAM and network.
  - Probably critical that 1800 machines had more then a terabyte of RAM.
  - And sorting is perhaps inherently about movement, not CPU.
- If all they did was sort, they should sell CPUs/disks and buy a faster network.
Modern data centers have relatively faster networks
- e.g. FDS's 5.5 terabits/sec cross-section b/w vs MR paper's 150 gigabits/sec
- while CPUs are only modestly faster than in MR paper
- so today bottleneck might have shifted away from net, towards CPU
For what applications doesn't MapReduce work well?
- Small updates (re-run whole computation?)
- Small unpredictable reads (neither Map nor Reduce can choose input)
- Multiple shuffles (can use multiple MR but not very efficient)
  - In general, data-flow graphs with more than two stages
- Iteration (e.g. page-rank)
MapReduce retrospective
- Single-handedly made big cluster computation popular
  - (tho coincident w/ big datacenters, cheap servers, data-oriented companies)
- Hadoop is still very popular
- Inspired better successors (Spark, DryadLINQ, &c)

CS6963 Distributed Systems

Lecture 12 MapReduce, Parallel Batch Processing

Performance evaluation