CS6963 Distributed Systems

Lecture 04 Argus, Atomicity, and Two Phase Commit

  • Lab 1
  • FDS
  • 2PC
  • Argus
  • Lab 2

  • Topics

    • distributed commit, two-phase commit
    • distributed transactions
    • Argus - language for distributed programming

Distributed commit and 2PC

  • The problem how to provide atomicity when multiple parties have to agree?
    • And when concurrent operations may effect whether we agree.
  • A bunch of computers are cooperating on some task, e.g. bank transfer
  • Each computer has a different role, e.g. src and dst bank account
  • Want to ensure atomicity: all execute, or none execute
    • "distributed transaction"
  • Challenges: crashes and network failures

    • What to do if part of a distributed computation crashes?
  • We want three properties for distributed commit protocol

    • These properties are also known as "consensus"
    • This will come into play more as we talk about Raft and Paxos
      • Often called "consensus" protocols, though 2PC that we are talking about today is one also.
    • C, A, and B start in state "unknown"
    • Each can move to state "abort" or "commit"
    • But then each never changes mind
    • Agreement: all nodes decide on the same value.
    • If any commit, none abort.
    • If any abort, none commit.
    • Termination: all nodes eventually decide.
    • If no failures, and A and B can commit, then commit.
    • If failures, come to some conclusion ASAP?
    • Validity: the decided on value must have been proposed by one of the nodes.
    • This will come into play when we get to generalized consensus.
    • (since doing nothing is correct...)
  • We're going to develop a protocol called "two-phase commit"

    • Used by distributed databases for multi-server transactions
    • And by Spanner and Argus
  • Simplest idea: single entity unilaterally decides whether to commit for all operations.

    • Problem 1: one node may not be able to track all state (distributed)
    • e.g. May not be possible to put all bank accounts on one node.
    • Problem 2: performance
    • One node may not be able to "decide" all transactions quickly enough.
      • Most operations probably don't even operate over related data
      • Hence don't need to coordinate with one another.
  • Next Simple Idea: single entity decides whether to commit but with agreement of particpants.

    • Prevents any chance of disagreement.
    • Call the Transaction Coordinator C.
    • Participants A and B.
    • C/A/B execute distributed commit protocol...
    • C reports "commit" or "abort" to client
  • Example:

    • Schedule a time to eat with friends.
    • Idea: Have one coordinator make the choice.
    • But can't dictate directly, may not end up there at the same time.
    • "We are eating at 6 PM."
    • Keep in mind, this is true even if the coordinator could look at friend's schedules.
    • Idea: first get commitment to a tentantive time ("prepare")
    • Each person reserves the tentative time in their calendar.
    • If anyone say no, start over.
    • Once every agrees, no one can back out.
    • Coordinator informs everyone of outcome ("commit")
    • This is a distributed commit protocol: 2PC.
  • Two-phase commit without failures:

    • [time diagram: client, C, A, B]
    • Client sends request to C.
    • C sends prepare messages to A and B.
    • A and B respond, saying whether they're willing to commit.
    • Respond "yes" if no conflicting operations, crashes, time outs.
    • If both say "yes", C sends "commit" messages.
    • If either says "no", C sends "abort" messages.
    • A/B "decide to commit" if they get a commit message.
    • i.e. they actually modify the user's calendar.
  • Why is this correct so far?

    • Neither can commit unless they both agreed.
    • Crucial that neither changes mind after responding to prepare
    • Not even if failure
  • What about failures?

    • Network broken/lossy
    • Server crashes
    • Both visible as timeout when expecting a message.
    • Crash models: fail-stop/fail-restart, Byzantine
  • Where do hosts wait for messages?

    • C waits for yes/no.
    • A and B wait for prepare and commit/abort.
  • Termination protocol summary:

    • C timeout for yes/no -> abort
    • B timeout for prepare -> abort
    • B timeout for commit/abort, B voted no -> abort
    • B timeout for commit/abort, B voted yes -> block
  • C timeout while waiting for yes/no from A/B.

    • C has not sent any "commit" messages.
    • So C can safely abort, and send "abort" messages.
  • A/B timeout while waiting for prepare from C

    • Have not yet responded to prepare
    • So can abort
    • Respond "no" to future prepare
  • A/B timeout while waiting for commit/abort from TC.

    • Let's talk about just B (A is symmetric).
    • If B voted "no", it can unilaterally abort.
    • So what if B voted "yes"?
    • Can B unilaterally decide to abort?
    • No! C might have gotten "yes" from both,
    • and sent out "commit" to A, but crashed before sending to B.
    • So then A would commit and B would abort: incorrect.
    • B can't unilaterally commit, either:
    • A might have voted "no".
  • If B voted "yes", it must "block": wait for C decision.

    • Question should be echoing in your mind: What if C is dead and gone?
  • What if B crashes and restarts?

    • If B sent "yes" before crash, B must remember!
    • Can't change to "no" (and thus abort) after restart
    • Since C may have seen previous yes and told A to commit
    • Thus:
    • B must remember on disk before saying "yes", including modified data.
    • B reboots, disk says "yes" but no "commit", must ask C.
    • If C says "commit", copy modified data to real data.
  • What if C crashes and restarts?

    • If C might have sent "commit" or "abort" before crash, C must remember!
    • And repeat that if anyone asks (i.e. if A/B/client didn't get msg).
    • Thus C must write "commit" to disk before sending commit msgs.
    • Can't change mind since A/B/client have already acted.
  • This protocol is called "two-phase commit".

    • What properties does it have?
    • All hosts that decide reach the same decision.
    • No commit unless everyone says "yes".
    • C failure can make servers block until repair.
    • Key problem with 2PC: in the face of failures it may block forever (else moving forward might result in inconsistent outcomes between participants).
    • Happens when a participant and the coordinator are unavailable.
    • Roles usually fused (you attend the dinner, you are scheduling)
    • Hence, C crash generally means no progress.

Transactions

  • What about concurrent transactions?
    • We realy want atomic distributed transactions,
    • not just single atomic commit.
    • x and y are bank balances
    • x and y start out as $10
    • T1 is doing a transfer of $1 from x to y
    • [code listing]
  T1:
    add(x, 1)  -- server A
    add(y, -1) -- server B
  T2:
    tmp1 = get(x)
    tmp2 = get(y)
    print tmp1, tmp2
  • Problem:

    • what if T2 runs between the two add() RPCs?
    • then T2 will print 11, 10
    • money will have been created!
    • T2 should print 10,10 or 9,11
  • The traditional approach is to provide "serializability"

    • results appear as if transactions ran one at a time in some order
    • either T1, then T2; or T2, then T1
  • Why serializability?

    • it allows transaction code to ignore the possibility of concurrency
    • just write the transaction to take system from one legal state to another
    • internally, the transaction can temporarily violate invariants
      • but serializability guarantess no-one will notice
    • Think of this as providing the same guarantees as every transaction running in a critical section.
      • Except the system is only enforcing order where you'd otherwise be able to notice.
  • One way to implement serializabilty is with "two-phase locking"

    • this is what Argus does
    • each database record has a lock
    • the lock is stored at the server that stores the record
    • no need for a central lock server
    • each use of a record automatically acquires the record's lock
    • thus add() handler implicitly acquires lock when it uses record x or y
    • locks are held until after commit or abort
  • Why hold locks until after commit/abort?

    • Why not release as soon as done with the record?
    • Need all results to show up atomically.
    • e.g. why not have T2 release x's lock after first get()?
    • T1 could then execute between T2's get()s
    • T2 would print 10,9
    • but that is not a serializable execution: neither T1;T2 nor T2;T1
  • 2PC perspective

    • Used in sharded DBs when a transaction uses data on multiple shards.
    • But it has a bad reputation:
    • Slow because of multiple phases / message exchanges.
    • Locks are held over the prepare/commit exchanges.
    • C crash can cause indefinite blocking, with locks held
    • Thus usually used only in a single small domain
    • e.g. not between banks, not between airlines, not over wide area
  • Paxos and two-phase commit solve different problems!

    • Use Paxos to high availability by replicating
    • i.e. to be able to operate when some servers are crashed
    • the servers must have identical state (to first approximation)
    • Use 2PC when each participant does something different
    • And all of them must do their part
    • 2PC does not help availability
    • since all servers must be up to get anything done
    • Paxos does not ensure that all servers do something (in real time)
    • Since only a majority have to be alive
    • Though, they will in the limit if they are eventually up
      • And they will according to "virtual time"
  • What if you want high availability and distributed commit?

    • [diagram]
    • Each "server" should be a Paxos-replicated service
    • And the TC should be Paxos-replicated
    • Run two-phase commit where each participant is a replicated service
    • Then you can tolerate failures and still make progress
    • This is what Spanner does (for update transactions)

Case study: Argus

  • Argus's big ideas:

    • Language support for distributed programs
    • Very cool: language abstracts away ugly parts of distrib systems
    • Aimed at services interacting via RPC
    • Clean handling of RPC and server failure
    • Transactional updates via 2PC
    • So crash results in entire transaction un-done, not partial update
    • Easy persistence ("stable"):
    • Ordinary variables automatically persisted to disk
    • Automatic crash recovery
    • Easy concurrency control:
    • Multiple clients means multiple distributed transactions
    • Automatic locking of language objects
  • The overall design story seems very sensible

    • Starting point: you want to handle RPC failures cleanly
    • Clean failure handling means Argus needs transactions
    • Transaction roll-back means Argus must manage program objects
    • Crash recovery means Argus must handle persisting program objects
  • Picture

    • "guardian" is like an RPC server
    • has state (variables) and handlers
    • "handler" is an RPC handler
    • reads and writes local variables
    • "action" is a distributed atomic transaction
    • action on A
    • A RPC to B
      • B RPC to C
    • A RPC to D
    • A finishes action
    • prepare msgs to B, C, D
    • commit msgs to B, C, D
  • The style is to send RPC to where the data is

    • Not to fetch the data
    • Argus is not a storage system
  • Look at bank example

    • page 309 (and 306): bank transfer
  • Points to notice

    • stable keyword (programmer never writes to disk &c)
    • atomic keyword (programmer almost never locks/unlocks)
    • enter topaction (in transfer)
    • coenter (in transfer)
    • RPCs are hidden (e.g. f.withdraw())
    • RPC error handling hidden (just aborts)
  • what if deposit account doesn't exist?

    • but f.withdraw(from) has already been called?
    • how to un-do?
    • what's the guardian state when withdraw() handler returns?
    • lock, temporary version, just in memory
  • what if an audit runs during a transfer?

    • how does the audit not see the tentative new balances?
  • if a guardian crashes and reboots, what happens to its locks?

    • can it just forget about pre-crash locks?
  • Subactions

    • each RPC is actually a sub-action
    • the RPC can fail or abort w/o aborting surrounding action
    • this lets actions e.g. try one server, then another
    • if RPC reply lost, subaction will abort, undo
    • much cleaner than e.g. Go RPC
  • Is Argus's implicit locking the right thing?

    • Very convenient!
    • Don't have to worry about forgetting to lock!
    • (though deadlocks are easy)
    • Databases work (and worked) this way; it's a sucessful idea
    • Why might it be less successful in a procedural language?
  • Is transactions + RPC + 2PC a good design point?

    • Programmability pro:
    • Very easy to get nice fault tolerance semantics
    • Performance con:
    • Lots of msgs and disk writes
    • 2PC and 2PL hold locks for a while, block if failure
  • Is Argus's language integration the right thing?

    • i.e. persisting and locking language objects
    • It looks very convenient (and it is)
  • Why didn't more systems pick up on Argus' language-based approach?

    • Java RMI is perhaps the closest in common use>
    • Akka is similar in many ways as well.
    • Perhaps people prefer to build distributed systems around data
    • Not around RPC
    • Stable data and computation usually decoupled.
      • e.g. big web sites are very storage-centric
    • Database provides transactions, persistence, etc.
    • Tables, records, and queries are more powerful than Argus' data
    • Maybe there is a better language-based scheme waiting to be found
  • Big questions

    • At the start the paper claims Argus is good for coping with distributed programs since they may be partly running and partly crashed.
      • Does it make good on this promise? Does it have features that would complicate this?
    • Any explicit support for replication?
    • Seems like you have to roll your own.
    • What about the edge cases in 2PC?
    • Does this seem like something you could use at small scale? Large scale?

Lab 2

  • Similar test framework to Lab 1 Part 2/3.
  • Code significantly different.
    • Conventional RPC client/server architecture.
    • Normal locking, less of a need for exotic concurrency.
    • But still highly concurrent!
  • Lab 2 diagram
  • Key/Value servers
    • Use primary/backup replication scheme.
    • All reads/writes are sent to the primary.
    • Primary forwards them to the backup to ensure consistency.
    • All servers ping the "viewservice" to keep alive.
    • If backup fails, new backup chosen, seeded from primary.
    • If primary fails, backup is promoted, new backup is chosen
  • View service
    • Decides who is primary and who is backup.
    • On each ping returns view.
    • (ViewId, Primary Addr, Backup Addr)
    • Follows promotion rules.
    • failed backup -> new backup
    • failed primary -> backup to primary
    • Always wait for ack from new primary before advancing view.
      • Why?
      • This ensures that the old primary's backup won't accept writes from it.
      • Prevents split brain: clients can't read/write from deposed primaries while other clients talk to the new primary.
      • Creates a problem, though: what if new primary doesn't ack?
      • We're stuck, can't advance safely.
      • Paxos will help us fix this soon.
  • Lab in two parts: viewservice first, then KVS.
  • Sanity check: why don't we use 2PC for this lab?
    • Would 2PC make sense in a KVS?