Lecture 01 Introduction

Agenda

Try to get people to front?
Introduce self.
Write name on board, pronounce name.
Today
- Break the ice.
- Layout the structure for the course, including the boring stuff.
- Jump right in and start discussing MapReduce.

Intro

What is a distributed system?
- multiple networked cooperating computers
- Examples:
- Email
- Gmail
- NFS
- HTTP
- DNS
- ARP
- Databases?
- MapReduce
Why distribute?
- Performance
- Parallel CPUs/mem/disk/net
- Capacity
- Fault-tolerance
- Connect physically separate entities
- Security via physical isolation?
But:
- complex, hard to debug
- new classes of problems, e.g. partial failure (did server accept my e-mail?)
- advice: don't distribute if a central system will work
Why take this course?
- interesting -- hard problems, non-obvious solutions
- active research area -- lots of progress + big unsolved problems
- used by real systems -- driven by the rise of big Web sites
- hands-on -- you'll build a real system in the labs

Assessment Exercise

Orgranization

Do (organization stuff)[01-org].

Main Topics

Calendar will flow through these topics (we'll visit all of these ideas several times in many orders, but this is the intended order of the focus of the papers):
- Messaging, remote interaction (RPC)
- Fault-tolerance, replication, and consensus (Raft)
- Primary-backup replication (GFS)
- Fault-tolerant large-scale compute (MapReduce, Spark)
- Consistency/consistency models (Bayou, Dynamo)
- Real-world consistency and scaling (Scaling Memcached at Facebook)
- Transactions (Thor, Spanner, Argus)
- Byzantine fault-tolerance, P2P (PBFT, Bitcoin)
- Other possibly topics: verifying distributed systems (Verdi)

Discussion

Example:
- a shared file system, so users can cooperate, like NFS
- lots of client computers
- [diagram: clients, network, vague set of servers]
So many possibilities; let go of how you expect it to work, and just consider all of the possibilities.
Topic: architecture
- What interface?
  - Clients talk to servers -- what do they say?
  - File system (files, file names, directories, etc.)?
  - Disk blocks, with FS in client?
  - Separate naming + file servers?
  - Separate FS + block servers?
- Single machine room or unified wide area system?
  - Wide-area more difficult.
- Transparent?
  - i.e. should it act exactly like a local disk file system?
  - or is it OK if apps/users have to cope with distribution, e.g. know what server files are on, or deal with failures.
- Client/server or peer-to-peer?
- All these interact w/ performance, usefulness, fault behavior.
Topic: implementation
- How to simplify network communication?
  - Can be messy (msg formatting, re-transmission, host names, etc.)
  - Frameworks can help: RPC, MapReduce, etc.
- How to cope with inherent concurrency?
  - Threads, locks, etc.
Topic: performance
- Distribution can hurt: network b/w and latency bottlenecks
  - Lots of tricks, e.g. caching, concurrency, pre-fetch
- Distribution can help: parallelism, pick server near client
- Idea: scalable design
  - Nx servers -> Nx total performance
- Need a way to divide the load by N
  - Divide data over many servers ("sharding" or "partitioning")
  - By hash of file name?
  - By user?
  - Move files around dynamically to even out load?
  - "Stripe" each file's blocks over the servers?
- Performance scaling is rarely perfect
  - Some operations are global and hit all servers (e.g. search)
    - Nx servers -> 1x performance
  - Load imbalance
    - Everyone wants to get at a single popular file
    - one server 100%, added servers mostly idle
    - Nx servers -> 1x performance
Topic: fault tolerance
- Big system (1000s of server, complex net) -> always something broken
- We might want:
  - Availability -- I can keep using my files despite failures
  - Durability -- my files will come back to life someday
- Availability idea: replicate
  - Servers form pairs, each file on both servers in the pair
  - Client sends every operation to both
  - If one server down, client can proceed using the other
- Opportunity: operate from both "replicas" independently if partitioned?
- Opportunity: can 2 servers yield 2x availability AND 2x performance?
Topic: consistency
- Assume a contract w/ apps/users about meaning of operations
- e.g. "read yields most recently written value"
- Consistency is about fulfiling the contract despite failure, replication/caching, concurrency, etc.
- Problem: keep replicas identical
- If one is down, it will miss operations
  - Must be brought up to date after reboot
- If net is broken, both replicas maybe live, and see different ops
  - Delete file, still visible via other replica
  - "split brain" -- usually bad
- Problem: clients may see updates in different orders
  - Due to caching or replication
  - I make a class directory private, then TA creates grades file
  - What if the operations run in different order on different replicas?
- Consistency often hurts performance (communication, blocking)
  - Many systems cut corners -- "relaxed consistency"
  - Shifts burden to applications

Labs

Lab submission is weird; walk through that.
focus: fault tolerance and consistency -- central to distrib sys
- lab 1: MapReduce
- labs 2 through 4: storage servers
- progressively more sophisticated (tolerate more kinds of faults)
  - progressively harder too!
- patterned after real systems, e.g. MongoDB
- end up with core of a real-world design for 1000s of servers
what you'll learn from the labs
- easy to listen to lecture / read paper and think you understand
- building forces you to really understand
- you'll have to do some design yourself
- we supply skeleton, requirements, and tests
- you'll have substantial scope to solve problems your own way
- you'll get experience debugging distributed systems
- tricky due to concurrency, unreliable messages
we've tried to ensure that the hard problems have to do w/ distrib sys
- not e.g. fighting against language, libraries, etc.
- thus Go (type-safe, garbage collected, slick RPC library)
- thus fairly simple services (mapreduce, key/value store)
grades depend on how many test cases you pass
- we give you the tests, so you know whether you'll do well
- careful: if it usually passes, but occasionally fails, chances are it will fail when we run it
Lab 1: MapReduce
- framework for parallel programming on 1000s of computers
- help you get up to speed on Go and distributed programming
- first exposure to some fault tolerance
- motivation for better fault tolerance in later labs
- motivating app for many papers
- popular distributed programming framework
- with many intellectual children
MapReduce computational model
- programmer defines Map and Reduce functions
- input is key/value pairs, divided into splits
- perhaps lots of files, k/v is filename/content
- Where do the k/v pairs come from?
  - Usually massive shared FS (GFS, see FDS lecture).
  - MR needs to know how to parse the files to convert intp k/v pairs.

// Apply a function to each key/value pair, each application produces a list of
// key value pairs, perhaps with different types than the input.
map :: (k1, v1) -> [(k2, v2)]

// For each v2 from map that share a common k2, apply a function that 'merges'
// them resulting in a list of v2s.
reduce :: (k2, [v2]) -> [v2]

Distributed grep

map :: (linenum, string) -> [(linenum, string)]
map (l s) = if contains("search-term") [(l, s)] else []

reduce :: (linenum, [string]) -> [string]
reduce (l ss) = "Match on line " ++ linenum ++ ":" ++ (head ss)

Sum values for all matching keys:

  Input Map -> a,1 b,7 c,9
  Input Map ->     b,2
  Input Map -> a,3     c,7
                |   |   |
                    |   -> Reduce -> c,16
                    -----> Reduce -> b,9

MR framework calls Map() on each split, produces set of k2,v2
MR framework gathers all Maps' v2's for a given k2, and passes them to a Reduce call
final output is set of pairs from Reduce()
- Example: word count
- input is thousands of text files

  Map(k, v)
    split v into words
    for each word w
      emit(w, "1")
  Reduce(k, v)
    emit(len(v))

What does MR framework do for word count?
- [master, input files, map workers, map output, reduce workers, output files]

  input files:
    f1: a b
    f2: b c
  send "f1" to map worker 1
    Map("f1", "a b") -> <a 1> <b 1>
  send "f2" to map worker 2
    Map("f2", "b c") -> <b 1> <c 1>
  framework waits for Map jobs to finish
  workers sort Map output by key
  framework tells each reduce worker what key to reduce
    worker 1: a
    worker 2: b
    worker 2: c
  each reduce worker pulls needed Map output from Map workers
    worker 1 pulls "a" Map output from every worker
  each reduce worker calls Reduce once for each of its keys
    worker 1: Reduce("a", [1]) -> 1
    worker 2: Reduce("b", [1, 1]) -> 2
              Reduce("c", [1]) -> 1

Why is the MR framework convenient?
- programmer only needs to think about the core work, the Map and Reduce functions, does not have to worry network communication, failure, etc.
- the grouping by key between Map and Reduce fits some applications well (e.g., word count), since it brings together data needed by the Reduce.
- but some applications don't fit well, because MR only allows the one type of communication between different parts of the application. e.g. word count but sort by frequency.
Why might MR have good performance?
- Map and Reduce functions run in parallel on different workers
  - Nx workers -> divide run-time by N
- But rarely quite that good:
  - move map output to reduce workers
  - stragglers
  - read/write network file system
What about failures?
- People use MR with 1000s of workers and vast inputs
- Suppose each worker only crashes once per year
  - That's 3 per day!
- So a big MR job is very likely to suffer worker failures
- Other things can go wrong:
  - Worker may be slow
  - Worker CPU may compute incorrectly
  - Master may crash
  - Parts of the network may fail, lose packets, etc.
  - Map or Reduce or framework may have bugs in software
Tools for dealing with failure?
- retry -- if worker fails, run its work on another worker
- replicate -- run each Map and Reduce on two workers
- replace -- for long-term health
- MapReduce uses all of these
Puzzles for retry
- how do we know when to retry?
- can we detect when Map or Reduce worker is broken?
- can we detect incorrect worker output?
- can we distinguish worker failure from worker up, network lossy?
- why is retry correct?
- what if Map produces some output, then crashes?
  - will we get duplicate output?
- what if we end up with two of the same Map running?
- in general, calling a function twice is not the same as calling it once
- why is it OK for Map and Reduce?
Helpful assumptions
- One must make assumptions, otherwise too hard
- No bugs in software
- No incorrect computation: worker either produces correct output,
  - or nothing -- assuming fail-stop.
- Master doesn't crash
- Map and Reduce are pure functions on their arguments
  - they don't secretly read/write files, talk to each other,
  - send/receive network messages, etc.
lab 1 has four parts:
- Part I: Do I/O for Map and reduce
- Part II: just Map() and Reduce() for word count
- Part III: we give you most of a distributed multi-server framework,
  - you fill in the master code that hands out the work to a set of worker threads.
- Part IV: make master cope with crashed workers by re-trying.
Part II: main/wc.go
- stubs for Map and Reduce
- you fill them out to implement word count
- Map argument is a string, a big chunk of the input file

demo of solution to Part I

  ./wc master kjv12.txt sequential
  more mrtmp.kjv12.txt-1-2
  more mrtmp.kjv12.txt

Part I sequential framework: mapreduce/mapreduce.go RunSingle()
- split, maps, reduces, merge
Part II parallel framework:
- master
- workers...
- shared file system
- our code splits the input before calling your master,
  - and merges the output after your master returns
- our code only tells the master the number of map and reduce splits (jobs)
- each worker sends Register RPC to master
  - your master code must maintain a list of registered workers
- master sends DoJob RPCs to workers
  - if 10 map jobs and 3 workers,
  - send out 3, wait until one worker says it's done,
  - send it another, until all 10 done
- then the same for reduces
- master only needs to send job # and map vs reduce to worker
  - worker reads input from files
- so your master code only needs to know the number of
  - map and reduce jobs!
  - which it can find from the "mr" argument
Thursday:
- master and workers talk via RPC, which hides network complexity
- more about RPC on Thursday
Extra time:
- Lab setup
- Lab submission status
- Git workflow
  - Explain about Gitlab account name in detail
- tour.golang.org

CS6963 Distributed Systems

Lecture 01 Introduction

Agenda

Intro

Assessment Exercise

Orgranization

Main Topics

Discussion

Labs