Lecture 04 Argus, Atomicity, and Two Phase Commit
- Lab 1
- FDS
- 2PC
- Argus
Lab 2
Topics
- distributed commit, two-phase commit
- distributed transactions
- Argus - language for distributed programming
Distributed commit and 2PC
- The problem how to provide atomicity when multiple parties have to agree?
- And when concurrent operations may effect whether we agree.
- A bunch of computers are cooperating on some task, e.g. bank transfer
- Each computer has a different role, e.g. src and dst bank account
- Want to ensure atomicity: all execute, or none execute
- "distributed transaction"
Challenges: crashes and network failures
- What to do if part of a distributed computation crashes?
We want three properties for distributed commit protocol
- These properties are also known as "consensus"
- This will come into play more as we talk about Raft and Paxos
- Often called "consensus" protocols, though 2PC that we are talking
about today is one also.
- C, A, and B start in state "unknown"
- Each can move to state "abort" or "commit"
- But then each never changes mind
- Agreement: all nodes decide on the same value.
- If any commit, none abort.
- If any abort, none commit.
- Termination: all nodes eventually decide.
- If no failures, and A and B can commit, then commit.
- If failures, come to some conclusion ASAP?
- Validity: the decided on value must have been proposed by one of the nodes.
- This will come into play when we get to generalized consensus.
- (since doing nothing is correct...)
We're going to develop a protocol called "two-phase commit"
- Used by distributed databases for multi-server transactions
- And by Spanner and Argus
Simplest idea: single entity unilaterally decides whether to commit for all
operations.
- Problem 1: one node may not be able to track all state (distributed)
- e.g. May not be possible to put all bank accounts on one node.
- Problem 2: performance
- One node may not be able to "decide" all transactions quickly enough.
- Most operations probably don't even operate over related data
- Hence don't need to coordinate with one another.
Next Simple Idea: single entity decides whether to commit but with agreement
of particpants.
- Prevents any chance of disagreement.
- Call the Transaction Coordinator C.
- Participants A and B.
- C/A/B execute distributed commit protocol...
- C reports "commit" or "abort" to client
Example:
- Schedule a time to eat with friends.
- Idea: Have one coordinator make the choice.
- But can't dictate directly, may not end up there at the same time.
- "We are eating at 6 PM."
- Keep in mind, this is true even if the coordinator could look at friend's
schedules.
- Idea: first get commitment to a tentantive time ("prepare")
- Each person reserves the tentative time in their calendar.
- If anyone say no, start over.
- Once every agrees, no one can back out.
- Coordinator informs everyone of outcome ("commit")
- This is a distributed commit protocol: 2PC.
Two-phase commit without failures:
- [time diagram: client, C, A, B]
- Client sends request to C.
- C sends
prepare
messages to A and B.
- A and B respond, saying whether they're willing to commit.
- Respond "yes" if no conflicting operations, crashes, time outs.
- If both say "yes", C sends "commit" messages.
- If either says "no", C sends "abort" messages.
- A/B "decide to commit" if they get a commit message.
- i.e. they actually modify the user's calendar.
Why is this correct so far?
- Neither can commit unless they both agreed.
- Crucial that neither changes mind after responding to prepare
- Not even if failure
What about failures?
- Network broken/lossy
- Server crashes
- Both visible as timeout when expecting a message.
- Crash models: fail-stop/fail-restart, Byzantine
Where do hosts wait for messages?
- C waits for yes/no.
- A and B wait for prepare and commit/abort.
Termination protocol summary:
- C timeout for yes/no -> abort
- B timeout for prepare -> abort
- B timeout for commit/abort, B voted no -> abort
- B timeout for commit/abort, B voted yes -> block
C timeout while waiting for yes/no from A/B.
- C has not sent any "commit" messages.
- So C can safely abort, and send "abort" messages.
A/B timeout while waiting for prepare from C
- Have not yet responded to prepare
- So can abort
- Respond "no" to future prepare
A/B timeout while waiting for commit/abort from TC.
- Let's talk about just B (A is symmetric).
- If B voted "no", it can unilaterally abort.
- So what if B voted "yes"?
- Can B unilaterally decide to abort?
- No! C might have gotten "yes" from both,
- and sent out "commit" to A, but crashed before sending to B.
- So then A would commit and B would abort: incorrect.
- B can't unilaterally commit, either:
- A might have voted "no".
If B voted "yes", it must "block": wait for C decision.
- Question should be echoing in your mind: What if C is dead and gone?
What if B crashes and restarts?
- If B sent "yes" before crash, B must remember!
- Can't change to "no" (and thus abort) after restart
- Since C may have seen previous yes and told A to commit
- Thus:
- B must remember on disk before saying "yes", including modified data.
- B reboots, disk says "yes" but no "commit", must ask C.
- If C says "commit", copy modified data to real data.
What if C crashes and restarts?
- If C might have sent "commit" or "abort" before crash, C must remember!
- And repeat that if anyone asks (i.e. if A/B/client didn't get msg).
- Thus C must write "commit" to disk before sending commit msgs.
- Can't change mind since A/B/client have already acted.
This protocol is called "two-phase commit".
- What properties does it have?
- All hosts that decide reach the same decision.
- No commit unless everyone says "yes".
- C failure can make servers block until repair.
- Key problem with 2PC: in the face of failures it may block forever (else
moving forward might result in inconsistent outcomes between participants).
- Happens when a participant and the coordinator are unavailable.
- Roles usually fused (you attend the dinner, you are scheduling)
- Hence, C crash generally means no progress.
Transactions
- What about concurrent transactions?
- We realy want atomic distributed transactions,
- not just single atomic commit.
- x and y are bank balances
- x and y start out as $10
- T1 is doing a transfer of $1 from x to y
- [code listing]
T1:
add(x, 1) -- server A
add(y, -1) -- server B
T2:
tmp1 = get(x)
tmp2 = get(y)
print tmp1, tmp2
Problem:
- what if T2 runs between the two add() RPCs?
- then T2 will print 11, 10
- money will have been created!
- T2 should print 10,10 or 9,11
The traditional approach is to provide "serializability"
- results appear as if transactions ran one at a time in some order
- either T1, then T2; or T2, then T1
Why serializability?
- it allows transaction code to ignore the possibility of concurrency
- just write the transaction to take system from one legal state to another
- internally, the transaction can temporarily violate invariants
- but serializability guarantess no-one will notice
- Think of this as providing the same guarantees as every transaction running
in a critical section.
- Except the system is only enforcing order where you'd otherwise be able
to notice.
One way to implement serializabilty is with "two-phase locking"
- this is what Argus does
- each database record has a lock
- the lock is stored at the server that stores the record
- no need for a central lock server
- each use of a record automatically acquires the record's lock
- thus add() handler implicitly acquires lock when it uses record x or y
- locks are held until after commit or abort
Why hold locks until after commit/abort?
- Why not release as soon as done with the record?
- Need all results to show up atomically.
- e.g. why not have T2 release x's lock after first get()?
- T1 could then execute between T2's get()s
- T2 would print 10,9
- but that is not a serializable execution: neither T1;T2 nor T2;T1
2PC perspective
- Used in sharded DBs when a transaction uses data on multiple shards.
- But it has a bad reputation:
- Slow because of multiple phases / message exchanges.
- Locks are held over the prepare/commit exchanges.
- C crash can cause indefinite blocking, with locks held
- Thus usually used only in a single small domain
- e.g. not between banks, not between airlines, not over wide area
Paxos and two-phase commit solve different problems!
- Use Paxos to high availability by replicating
- i.e. to be able to operate when some servers are crashed
- the servers must have identical state (to first approximation)
- Use 2PC when each participant does something different
- And all of them must do their part
- 2PC does not help availability
- since all servers must be up to get anything done
- Paxos does not ensure that all servers do something (in real time)
- Since only a majority have to be alive
- Though, they will in the limit if they are eventually up
- And they will according to "virtual time"
What if you want high availability and distributed commit?
- [diagram]
- Each "server" should be a Paxos-replicated service
- And the TC should be Paxos-replicated
- Run two-phase commit where each participant is a replicated service
- Then you can tolerate failures and still make progress
- This is what Spanner does (for update transactions)
Case study: Argus
Argus's big ideas:
- Language support for distributed programs
- Very cool: language abstracts away ugly parts of distrib systems
- Aimed at services interacting via RPC
- Clean handling of RPC and server failure
- Transactional updates via 2PC
- So crash results in entire transaction un-done, not partial update
- Easy persistence ("stable"):
- Ordinary variables automatically persisted to disk
- Automatic crash recovery
- Easy concurrency control:
- Multiple clients means multiple distributed transactions
- Automatic locking of language objects
The overall design story seems very sensible
- Starting point: you want to handle RPC failures cleanly
- Clean failure handling means Argus needs transactions
- Transaction roll-back means Argus must manage program objects
- Crash recovery means Argus must handle persisting program objects
Picture
- "guardian" is like an RPC server
- has state (variables) and handlers
- "handler" is an RPC handler
- reads and writes local variables
- "action" is a distributed atomic transaction
- action on A
- A RPC to B
- A RPC to D
- A finishes action
- prepare msgs to B, C, D
- commit msgs to B, C, D
The style is to send RPC to where the data is
- Not to fetch the data
- Argus is not a storage system
Look at bank example
- page 309 (and 306): bank transfer
Points to notice
- stable keyword (programmer never writes to disk &c)
- atomic keyword (programmer almost never locks/unlocks)
- enter topaction (in transfer)
- coenter (in transfer)
- RPCs are hidden (e.g. f.withdraw())
- RPC error handling hidden (just aborts)
what if deposit account doesn't exist?
- but f.withdraw(from) has already been called?
- how to un-do?
- what's the guardian state when withdraw() handler returns?
- lock, temporary version, just in memory
what if an audit runs during a transfer?
- how does the audit not see the tentative new balances?
if a guardian crashes and reboots, what happens to its locks?
- can it just forget about pre-crash locks?
Subactions
- each RPC is actually a sub-action
- the RPC can fail or abort w/o aborting surrounding action
- this lets actions e.g. try one server, then another
- if RPC reply lost, subaction will abort, undo
- much cleaner than e.g. Go RPC
Is Argus's implicit locking the right thing?
- Very convenient!
- Don't have to worry about forgetting to lock!
- (though deadlocks are easy)
- Databases work (and worked) this way; it's a sucessful idea
- Why might it be less successful in a procedural language?
Is transactions + RPC + 2PC a good design point?
- Programmability pro:
- Very easy to get nice fault tolerance semantics
- Performance con:
- Lots of msgs and disk writes
- 2PC and 2PL hold locks for a while, block if failure
Is Argus's language integration the right thing?
- i.e. persisting and locking language objects
- It looks very convenient (and it is)
Why didn't more systems pick up on Argus' language-based approach?
- Java RMI is perhaps the closest in common use>
- Akka is similar in many ways as well.
- Perhaps people prefer to build distributed systems around data
- Not around RPC
- Stable data and computation usually decoupled.
- e.g. big web sites are very storage-centric
- Database provides transactions, persistence, etc.
- Tables, records, and queries are more powerful than Argus' data
- Maybe there is a better language-based scheme waiting to be found
Big questions
- At the start the paper claims Argus is good for coping with distributed
programs since they may be partly running and partly crashed.
- Does it make good on this promise? Does it have features that would
complicate this?
- Any explicit support for replication?
- Seems like you have to roll your own.
- What about the edge cases in 2PC?
- Does this seem like something you could use at small scale? Large scale?
Lab 2
- Similar test framework to Lab 1 Part 2/3.
- Code significantly different.
- Conventional RPC client/server architecture.
- Normal locking, less of a need for exotic concurrency.
- But still highly concurrent!
- Lab 2 diagram
- Key/Value servers
- Use primary/backup replication scheme.
- All reads/writes are sent to the primary.
- Primary forwards them to the backup to ensure consistency.
- All servers ping the "viewservice" to keep alive.
- If backup fails, new backup chosen, seeded from primary.
- If primary fails, backup is promoted, new backup is chosen
- View service
- Decides who is primary and who is backup.
- On each ping returns view.
- (ViewId, Primary Addr, Backup Addr)
- Follows promotion rules.
- failed backup -> new backup
- failed primary -> backup to primary
- Always wait for ack from new primary before advancing view.
- Why?
- This ensures that the old primary's backup won't accept writes from it.
- Prevents split brain: clients can't read/write from deposed primaries
while other clients talk to the new primary.
- Creates a problem, though: what if new primary doesn't ack?
- We're stuck, can't advance safely.
- Paxos will help us fix this soon.
- Lab in two parts: viewservice first, then KVS.
- Sanity check: why don't we use 2PC for this lab?
- Would 2PC make sense in a KVS?