CS6963 Distributed Systems

Lecture 07b Replicated State Machines

  • Objectives
    • Understand the concept of availability.
    • Understand how replication improves availability.
    • Understand how replication threatens consistency.
    • Understand how primary/backup replication breaks down under network partitions.
      • Cannot safely continue operation and guarantee both availability and consistency without operator intervention.
    • Understand that any completely "synchronous" replication will not be able to remain consistent and available.
    • Understand the need for replicated state machines and totally ordered operation logs to provide availability and consistency together.

Intro to the Problem that Consensus Solves

  • We want services that remain available even if machines crash or are partitioned away.
    • For this lecture partitions are network partitions, not splitting state.
    • Replicated key-value stores, databases, cluster filesystems, etc.
    • [quick diagram, see sequence below]
    • Network partition: a split in the network that separates away some nodes.
    • Note: this is indistinguishable from a crash.
    • We'll see this creates challenges.
    • Availability here means that operations can continue as if the crash/partition hadn't happened.
o - o   x - o   o x o
 \ /     \ /     x /
  o       o       o
  • We'd like to remain available in all these cases.

  • A key problem: split-brain (due to network partitions).

    • What if machines on one side of the split keep talking to some replicas.
    • On other side clients keep talking to other replicas.
    • Users see some "fork" in the state of the world.
    • e.g. There is one item left in stock.
    • [diagram this quickly using the above diagrams]
    • I buy a product that's in stock.
    • You buy a product that's in stock.
    • We were both on different sides of the partition =(
    • Once the network partition subsides the system will still be inconsistent.
  • Lab 2 'solved' this: how?

    • Stop servers on the 'wrong' side of the partition from operating.
    • If primary is partitioned away from the viewservice, then switch to the backup.

Lab 2 scenarios:

   V
 /   \
P --- B

Primary partitioned away:

   V
 x   \
P -x- B

Ok - when backup promoted it won't accept any new ops from the old primary.
Clients on the "side" with the primary won't function, but

Backup partitioned away:

   V
 /   x
P -x- B

Doesn't matter - clients are 'unaware' of the backup.

Primary and Backup partitioned away:

   V
 x   x
P --- B

Existing clients ok - they keep using old P/B.
New clients are stuck. Not so good.

P' -- B'
 \   /
   V
 x   x
P --- B

Problem if we move to new set of primaries/backups.
Clients could still be talking to old pair and ops would be succeeding.

  • Q: How does lab2 prevent this?

    • Force view to go from P as primary to B.
    • This guarantees that P can't continue to operate.
      • Since B rejects P's sync replication operations.
    • But V can't reach B.
    • If it chooses a new P', then state may diverge.
    • If it doesn't choose a new P', then system hangs indefinitely.
      • Lab 2's fix is to hang in this case.
      • But unavailability when one node is down is what we want to avoid.
      • Here this is equivalent to the viewservice failing.
  • Related/second problem: viewservice is still a single point of failure.

    • But if we try to use multiple machines to make it reliable, we have to solve the same problem again!!!
  • May seem like this is a result of this being a toy. But it's fundamental to primary/backup.

    • In practice, many primary/backup systems hang on partitions.
    • As least as often, they just result in silent data corruption.
    • One solution: have the operator guarantee that the systems that have lost track of the viewserver are dead (e.g. powered down).
      • Sinfonia relied on this.
    • Another common solution: have servers commit suicide automatically when they can't ping the viewservice.
      • This can work, but its easy to get wrong.
      • e.g. The thread that commits suicide gets delayed and RPC handlers continue a bit longer than they are supposed to.
  • Q: Is it even possible to solve this problem?

    • Must we choose between availability and consistency even just to tolerate even one node failure?
    • Turns out a wicked genius solved this. This is Turing Award material.
    • Common problem: genius is a poor communicator.
    • Idea: use quorum majority to "vote" on operations.
      • [sketch this against the prior drawing]
      • Servers in the majority remain consistent.
      • When servers rejoin the majority, they catch up.
    • Tolerates f failures in a cluster for 2f + 1 servers.
      • 3 servers can run with 1 down.
      • 5 with 2 down, etc.
  • Lot's of mechanism to discuss before we get to algorithms.

    • Why? Isn't it something like 2PC?
    • No. This 'catch-up' part is fundamental and adds an element of asynchrony.
    • Whenever anyone is separated, no matter how long, they need to be able to figure out what the majority agreed after rejoining.
      • If they can't they will remain inconsistent.
    • Another way to think of it: if we can't communicate with a down server we have to place the messages to it somewhere.
      • Q: Where can we place the messages that is guaranteed to be accessible and will survive until the server comes back?
      • A: On the available majority.
  • To summarize:

    • We use redundancy/replication to try to remain available after a failure.
    • But on "failure" if we choose a primary, the old one might still be operating.
    • We can't use a single leader/viewservice to solve this.
    • We get stuck in cases where the viewservice can't guarantee old primaries aren't operating.
    • The viewservice is a single point of failure anyway.
    • We're going to use a quorum majority voting scheme, where nodes figure out among themselves the operations that can be safely applied and the order they should be applied in.

Want to handle partitions safely and have no single point of failure.

The Pieces

First, we need to start with a few pieces of mechanism.

[draw diagram for RSMs from Slide]

  • We'll see three key pieces we can solve independently.
    • How do we manage the real data/state on each server?
      • Servers may be at different "points in logical time" depending on network delays and partitions.
    • Replicated State Machines
    • How can we move the data/state forward in some total order consistent across all state machines?
      • Slotted operation Log
    • How do we determine what operation should be in each log slot?
    • By Quorum Consensus: Paxos

State-machine Replication

  • What is a replicated state machine?

    • Just an instance of a class or ADT with deterministic methods.
    • Examples:
      • Hashtable: buckets, Get(), Put()
      • Btree: nodes, Insert(), Remove(), Scan()
      • Counter: count, Inc()
      • Register: word, Read(), Write()
    • Key: methods are determinstic.
    • Given many RSMs in the same initial state.
      • Applying operations in the same order with the same arguments will produce the same state and the same return values.
  • Idea: get machines to agree on a log of operations and feed to RSMs

    • Slotted log totally orders operations
    • Nodes can agree on the order and arguments
    • Once each node knows it agrees with majority up through prefix of the log
    • Feed it to RSM
    • This keeps all the RSMs 'logically' in sync.
      • In practice, they may be operating at different rates.
      • Network delay/partitions, for example.
      • But this is precisely why we're using them. If someone gets separated, they have a clean way to rejoin and catch up.
  • Key question: how to we get nodes to agree on what's in the log?

    • That's where consensus comes in.

Paxos

  • Next class, you'll want to watch the Paxos video again.
  • It quickly covers what we talked about here.
  • Gets right into Basic Paxos.
  • Read Paxos Made Simple and make a good effort.
  • Really make sure the video clicks for you.
  • You'll be implementing a simplified version of Multi-paxos.
  • If you can get your head around Basic Paxos you'll be well on your way.