CS6963 Distributed Systems

Lecture 08 Consensus, Paxos

  • Objectives
    • Understand what problem Paxos solves.
    • Understand how (Basic) Paxos provides consistency with loss of a minority.
    • Understand how Paxos can be extended to support general high-availability systems (Multi Paxos and State Machine Replication)

Why Paxos?

  • Two key problems with Paxos

    • "Two phase" protocol where phase purposes are intertwined
    • Full detail needed for systems left as execercise to reader
    • It's easy to get these details wrong!
  • Complicated, so why Paxos?

    • Ubiquitous: the brain of many (most?) large scale datacenter systems.
    • One of the greatest/most important results of distributed systems.
    • Likely to encounter it in the future.
    • Needed for Lab 3.

Goal: Replicated Log (2)

  • State machine: state with deterministic methods.
  • Run same state machine on all servers for reliabilty.
  • Deterministic: so feed same commands in same order to all state machines and they'll all have identical state at each logical point in time.
  • Why is the log needed? Why not feed each state machine synchronously?

    • Unavailable servers need to 'catch up' when they rejoin the majority.
    • Easy way to create a total order of commands, also.
  • Walkthough:

    • Client sends command to a server.
    • Server records the operation in its log and uses the consensus module to replicate the operation to the logs of the other servers.
    • Once servers agree on a prefix of the log, that prefix can be fed through the state machines.
    • The state machine processes the commands and responds to the client.
  • Consensus module makes sure this replication happens safely.

    • And works even if a minority of the nodes are down.
    • 5 node work with 3 up, etc.
  • Failure model: fail-stop/restart, arbitrary network partitions

The Paxos Approach (3)

  • Basic Paxos:
    • Nodes agree on a single value.
    • Note: this is not like a read/write register.
    • One-time use, monotonic.
    • Initially this doesn't seem that useful, so need Multi-Paxos

Requirements for Basic Paxos (4)

  • Safety: never do anything bad
    • Omits validity: system only chooses a value that has been proposed.
  • Liveness: eventually does something good

Paxos Components (5)

Strawman: Single Acceptor (6)

Problem: Split Votes (7)

(Could probably use more detail here on why majority quorum is a good idea.) - Q: Why a majority quorum? - It's the smallest possible set that is guaranteed to overlap with all others of the same size. - That is, if two operations require majority quorum and use maximally non-overlapping sets, they still can't miss each other's changes. - It's also 'small enough' that it tolerates some failures while remaining available.

  • Acceptors will have to change their mind to make this work.
    • Seems dangerous!?
  • Can't guarantee agreement in a single round.
  • Accepted != chosen

Problem: Conflicting Choices (8)

  • Two values chosen: violates safety property of single valued-ness
  • Solution: servers must first 'lookaround' to see if there are other values out there that have already been chosen. If they find one they can only propose that value rather than their own.
  • Creates a two-phase prepare/accept protocol:
    • First, find chosen values,
    • Second, propose a new value or the value found already chosen.

Problem: Conflicting Choices, cont'd (9)

  • Still busted.
  • Even if s1 and s5 look around first, they see nothing accepted or chosen.
  • In the end we still end up choosing two different values again.
  • Need to order proposals and have acceptors reject old proposals.

    • Idea: use the 'lookaround' phase to order proposals.
    • 'Later' blue lookaround will be fatal to red's accepts.
  • Make point here: with this, once blue gets value chosen, red's accept (at s3) is dead).

Proposal Numbers (10)

Basic Paxos (11)

  • Prepare
    • Forces proposer to propose any already chosen value.
    • Blocks acceptors from accepting older proposals to prevent them from becoming chosen while this one is 'in-flight'.

Basic Paxos (12)

  • Starts with client call and proposer wanting to choose a value.

Basic Paxos Examples (13)

  • Competing proposals to understand correctness
  • Focus on when second proposal prepares and enumerate cases
  • Explain notation

  • Majorities overlap so s5 must see s3s accept.

    • s5 must propose X - so consensus sticks on single value.

Basic Paxos Examples, cont'd (14)

  • Same as previous slide
  • Majorities overlap so s5 must see s3s accept.
    • s5 must propose X - so consensus sticks on single value.
  • Server must assume any accept it sees might be chosen, since it only issues proposals to a majority of the cluster.

Basic Paxos Examples, cont'd (15)

  • Prepare of s5 kills offs ability of s1 to get its proposal accepted.
  • This time Y is chosen, not X.
  • Competing proposers must overlap in at least one server, so they'll always 'see' each other.

  • End safety discuss: does everyone feel ok with this so far?

    • If so, congratulations.

Liveness (16)

Other Notes (17)

  • Proposer might not even know!

Multi-Paxos (18)

Multi-Paxos Issues (19)

  • Ensuring full replication

    • Want log entries on all available servers, not just majority
    • Have to repair logs that were partitioned away
    • How do servers find out which entries are chosen?
  • Lab 3 asks you to do things a bit differently.

Selecting Log Entries (20)

  • Assume s3 is offline.

Selecting Log Entries, cont'd (21)

Improving Efficiency (22)

Leader Election (23)

  • Unlikely that there are two leaders.
    • But safe even if there are.

Eliminating Prepares (24)

  • Idea: upon prepare for current log slot, return whether any later for this acceptor has accepted anything into any later log slots.

    • If it hasn't, then leader doesn't need to issue prepares to it anymore.
    • The proposal number covers the entire log.
  • Q: Lamport suggests something a bit different in Paxos Made Simple.

    • His is more of a range-based prepare phase.

Full Disclosure (25)

  • 1/4 breaks down when leader crashes and gives up on forcing accepts

Full Disclosure (26)

  • 3/4: Why can't acceptor use firstUnchosenIndex from leader to mark all earlier entries than that as chosen in it's log?

    • Because the leader doesn't know what the state of the acceptor's log is.
    • The acceptor could have gotten a partially replicated op from an old leader.
    • If the acceptor marked it chosen it might apply a different op than the rest of the cluster.
    • But leader doesn't know that the acceptor isn't up-to-date for wrt to prior leaders.
  • 3/4 ??? Can do this but is it needed?

    • Yes. This is the normal case flow. It avoids extra Success messages.
    • On each accept, we get firstUnchosenIndex from the leader.
    • This lets each acceptor know for all earlier slots with the same proposal number that they are chosen.
  • Why not just use 4/4?

    • This will work but it'll generate extra messages in the normal case.
    • 3/4 is a way to piggyback chosen status to acceptors without having to send extra messages.

Full Disclosure (27)

Client Protocol (28)

Client Protocol, cont'd (29)

  • Q: Is this correct?
    • Why if a client request is long delayed?
    • Fix: remember all ids, make them monotonic, or require client specifies a pair of old, new unique ids and effect only happens if that unique transition hasn't occured before.

Configuration Changes (30)

  • a limits concurrency: can't choose i + a until i is chosen
    • Because we don't even know the set of servers in the cluster at i + a.
  • Q: What configuration changes make sense?