CS6963 Distributed Systems

Lecture 04 GFS, Primary-backup replication, fault-tolerance, and consistency

  • The Google File System
  • Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
  • SOSP 2003

  • Why are we reading this paper?

    • the file system for map/reduce
    • case study of handling storage failures
      • trading consistency for simplicity and performance
      • motivation for subsequent designs
    • good performance -- great parallel I/O performance
    • good systems paper -- details from apps all the way to network
    • all main themes of class show up in this paper
      • performance, fault-tolerance, consistency
  • What is consistency?

    • A correctness condition
    • Important when data is replicated and concurrently accessed by applications
      • if an application performs a write, what will a later read observe?
      • what if the read is from a different application?
    • Weak consistency
      • read() may return stale data --- not the result of the most recent write
    • Strong consistency
      • read() always returns the data from the most recent write()
    • General trade-off:
      • strong consistency is nice for application writers
      • strong consistency is bad for performance
    • Many correctness conditions (often called consistency models)
  • History of consistency models

    • Much independent development in architecture, systems, and database communities
      • Concurrent processors with private caches accessing a shared memory
      • Concurrent clients accessing a distributed file system
      • Concurrent transactions on distributed database
    • Many different models with different trade-offs
      • serializability
      • sequential consistency
      • linearizability
      • entry consistency
      • release consistency
      • ....
    • Today first peak; will show up in almost every paper we read this term
  • "Ideal" consistency model

    • A replicated files behaves like as a non-replicated file system
      • picture: many clients on the same machine accessing files on a single disk
    • If one application writes, later reads will observe that write
    • What if two application concurrently write to the same file
      • In file systems often undefined --- file may have some mixed content
    • What if two application concurrently write to the same directory
      • One goes first, the other goes second
  • Sources of inconsistency

    • Concurrency
    • Machine failures
    • Network partitions
  • Example from GFS paper:

    • primary is partitioned from backup B
    • client appends 1
    • primary sends 1 to itself and backup A
    • reports failure to client
    • meanwhile client 2 may backup B and observe old value
  • Why is the ideal difficult to achieve in a distributed file system

    • Protocols can become complex --- see next week
      • Difficult to implement system correctly
    • Protocols require communication between clients and servers
      • May cost performance
  • GFS designers give up on ideal to get better performance and simpler design

    • Can make life of application developers harder
      • application observe behaviors that are non-observable in an ideal system
      • e.g., reading stale data
      • e.g., duplicate append records
      • But the data isn't your bank account, so maybe ok
    • Today's paper is an example of the struggle between:
    • consistency
    • fault-tolerance
    • performance
    • simplicity of design
  • GFS goal

    • create a shared file system
    • hundreds or thousands of (commodity, Linux based) physical machines to enable storing massive data sets
  • What does GFS store?

    • authors don't actually say
    • guesses for 2003:
      • search indexes & databases
      • all the HTML files on the web
      • all the images on the web
      • ...
  • Properties of files:

    • Multi-terabyte data sets
    • Many of the files are large
    • Authors suggest 1M files x 100 MB = 100 TB
      • but that was in 2003
    • Files are generally append only
  • Central challenge:

    • With so many machines failures are common
      • assume a machine fails once per year
      • w/ 1000 machines, ~3 will fail per day.
    • High-performance: many concurrent readers and writers
      • Map/Reduce jobs read and store final result in GFS
      • Note: not the temporary, intermediate files
    • Use network efficiently
  • High-level design

    • Directories, files, names, open/read/write
      • But not POSIX
    • 100s of Linux chunk servers with disks
      • store 64MB chunks (an ordinary Linux file for each chunk)
      • each chunk replicated on three servers
      • Q: why 3x replication?
      • Q: Besides availability of data, what does 3x replication give us?
        • load balancing for reads to hot files
        • affinity
    • Q: why not just store one copy of each file on a RAID'd disk?
      • RAID isn't commodity
      • Want fault-tolerance for whole machine; not just storage device
    • Q: why are the chunks so big?
    • GFS master server knows directory hierarchy
    • for dir, what files are in it
    • for file, knows chunk servers for each 64 MB
    • master keeps state in memory
      • 64 bytes of metadata per each chunk
    • master has private recoverable database for metadata
      • master can recovery quickly from power failure
    • shadow masters that lag a little behind master
      • can be promoted to master
  • Basic operation

    • client read:
      • send file name and offset to master
      • master replies with set of servers that have that chunk
      • clients cache that information for a little while
      • ask nearest chunk server
    • client write:
      • ask master where to store
      • maybe master chooses a new set of chunk servers if crossing 64 MB
      • one chunk server is primary
      • it chooses order of updates and forwards to two backups
  • Two different fault-tolerance plans

    • One for master
    • One for chunk servers
  • Master fault tolerance

    • Single master
      • Clients always talk to master
      • Master orders all operations
    • Stores limited information persistently
      • name spaces (directories)
      • file-to-chunk mappings
    • Log changes to these two in a log
      • log is replicated on several backups
      • clients operations that modify state return after recording changes in logs
      • logs play a central role in many systems we will read about
      • logs play a central role in labs
    • Limiting the size of the log
      • Make a checkpoint of the master state
      • Remove all operations from log from before checkpoint
      • Checkpoint is replicated to backups
    • Recovery
      • replay log starting from last checkpoint
      • chunk location information is recreated by asking chunk servers
    • Master is single point of failure
      • recovery is fast, because master state is small
      • so maybe unavailable for short time
      • shadow masters
        • lag behind master
        • they replay from the log that is replicated
      • can server read-only operations, but may return stale data
      • if master cannot recovery, master is started somewhere else
      • must be done with great care to avoid two masters
    • We will see schemes with stronger guarantees, but more complex
      • see next few lectures
  • Chunk fault tolerance

    • Master grants a chunk lease to one of the replicas
      • That replica is the primary chunk server
    • Primary determines orders operations
    • Clients pushes data to replicas
      • Replicas form a chain
      • Chain respects network topology
      • Allows fast replication
    • Client sends write request to primary
      • Primary assigns sequence number
      • Primary applies change locally
      • Primary forwards request to replicates
      • Primary responds to client after receiving acks from all replicas
    • If one replica doesn't respond, client retries
    • Master replicates chunks if number replicas drop below some number
    • Master rebalances replicas
  • Consistency of chunks

    • Some chunks may get out of date
      • they miss mutations
    • Detect stale data with chunk version number
      • before handing out a lease
        • increments chunk version number
          • sends it to primary and backup chunk servers
      • master and chunk servers store version persistently
    • Send version number also to client
    • Version number allows master and client to detect stale replicas
  • Concurrent writes/appends

    • clients may write to the same region of file concurrently
    • the result is some mix of those writes--no guarantees
      • few applications do this anyway, so it is fine
      • concurrent writes on Unix can also result in a strange outcome
    • many client may want to append concurrently to, e.g., a log file
      • GFS support atomic, at-least-once append
      • the primary chunk server chooses the offset where to append a record
      • sends it to all replicas.
      • if it fails to contact a replica, the primary reports an error to client
      • client retries; if retry succeeds:
        • some replicas will have the append twice (the ones that succeeded)
      • the file may have a "hole" too
        • when GFS pads to chunk boundary, if an append would across chunk boundary
  • Consistency model

    • Strong consistency for directory operations
      • Master performs changes to metadata atomically
      • Directory operations follow the "ideal"
      • But, when master is off-line, only shadow masters
      • Read-only operations only, which may return stale data
    • Weak consistency for chunk operations
      • A failed mutation leaves chunks inconsistent
      • The primary chunk server updated chunk
      • But then failed and the replicas are out of date
      • A client may read an not-up-to-date chunk
      • When client refreshes lease it will learn about new version #
    • Authors claims weak consistency is not a big problems for apps
      • Most file updates are append-only updates
      • Application can use UID in append records to detect duplicates
      • Application may just read less data (but not stale data)
      • Application can use temporary files and atomic rename
  • Performance (Figure 3)

    • huge aggregate throughput for read (3 copies, striping)
      • 125 MB/sec in aggregate
      • Close to saturating network
    • writes to different files lower than possible maximum
      • authors blame their network stack
      • it causes delays in propagating chunks from one replica to next
    • concurrent appends to single file
      • limited by the server that stores last chunk
  • Summary

    • Important FT techniques used by GFS
      • Logging & checkpointing
      • Primary-backup replication for chunks
      • but with consistencies
      • We will these in many other systems
    • what works well in GFS?
      • huge sequential reads and writes
      • appends
      • huge throughput (3 copies, striping)
      • fault tolerance of data (3 copies)
    • what less well in GFS?
      • fault-tolerance of master
      • small files (master a bottleneck)
      • concurrent updates to same file from many clients (except appends)