Lecture 06 Distributed Transactions
Sinfonia: a new paradigm for building scalable distributed systems
by Aguilera, Merchant, Shah, Veitch, and Karamanolis
Why this paper?
- We've been looking at transactions.
- This is more, but focused on real, practical systems building.
- Interesting because this is the first claim I've seen positioning
transactional KVS as a building block for large-scale systems.
- Something I've convinced myself is a great idea recently.
- Q: What should we be looking for in such a system?
- Q: How does this requirement change it versus a app-facing system?
What is this thing?
- A big shared storage system where small bits of data can be manipulated
with minitransactions (atomic bundles of reads/writes to memory locations).
- [figure 1]
- Transactional updates across multiple nodes.
- More of a memory-like abstraction than prior papers.
- Each memory node exposes linear address space.
- Data accessed through special pointers (memory node, offset).
Intro
- Q: Why are they hard on message passing?
- Wild west, no forced structure.
- Often giving people choice is a disadvantage.
- Separate state, computation, protocols.
- Can protect precious state.
Q: Why is group membership hard? Split brain.
Uses:
- File systems
- Lock managers
- Group communication services
DSM
Database systems lack the performance
- Let's come back to that after eval.
Coupling
- What's this about?
- Is an application that manipulates memory words tightly coupled to the
memory subsystem?
- Ask inventors of DSM.
- Good luck detangling a program that modifies shared state.
- Another example: running shared state programs on NUMA.
- Small change in constant factors breaks perf, near impossible to fix
an implementation because of implicit communication all over the
place.
- Apps do get tightly coupled to storage (e.g. SQL DBs), but which would you
rather port?
Idea: big address space, minitransactions
- Minitransactions are like compare-and-swap on steroids.
- Batches updates.
- Can be executed within the commit protocol.
- i.e. no need to execute reads/writes ahead of time before commit.
- no need for extra begin call.
- Replication can happen in parallel with 2PC.
- Q: Practically, how does this end up differing from Thor?
- Have to read in one txn.
- Perhaps use cached state.
- Compare/write in another and check for freshness.
- These guys claim two round trips per tx is a feat; how many did Thor
have?
Built an NFS server and group communication.
Assumptions and Goals
Design
- Principles
- Reduce operation coupling to obtain scalability.
- Q: What do they mean by this?
- Hard to say, but e.g. indexes, table metadata, etc.
- Q: Don't we just expect the app to have to build all that, though?
- Q: What is the right set in built-in primitives for such a system?
- Make components reliable before scaling them.
- This makes scaling easier.
- It misses opportunities, though.
- Often, if lower-level knows semantics of higher-level then it can take
shortcuts.
- e.g. If all data operations are associative and commutative then
enforcing message order, recovery replay order, replication order,
etc may not matter.
- This all increases coupling, though, so this makes sense for their goals.
Minitransactions
- ACID: all or nothing, one valid state to another (they say, data is not
corrupted), serializable, retained with high probability under some failure
model
- Actions: read/write/compare
- 2PC
- Q: Application nodes are the coordinator!?!?!
- Why is this ok? JT said it's bad in Thor lecture
- "last action does not affect"
- What they are really saying here:
- Usually the coordinator can call abort at any point in 2PC up until it
sends the first commit message.
- If the coordinator gives up that right and it's commit/abort decision is
solely a function of the actions it sends to the participants, then the
participants themselves can determine the commit/abort decision.
- Hence, the coordinator isn't really needed except to submit the action
list in Phase 1.
- Once it's rolling, if the participants lose touch with the coordinator,
they can still process the transaction to commit/abort.
- Effectivelly, the coordinator loses its vote in this scheme.
- Once it submits the request to the participants it's stuck with the
outcome.
Algorithm:
- On each memory node specified in a minitransaction:
- Compare everything in the compare set, abort if mismatch.
- TryReadLock/TryWriteLock everything in the read and write sets.
- Read everything in the read set.
- Write everything in the write set.
- Release all locks.
Kind of like Thor with guts of concurrency control exposed.
Q: Why don't they need clock assist to totally order transactions like
Thor?
Q: Why are the locks less problematic than in Argus?
Micro Examples [Figure 2]
- Swap
- Compare-and-swap
- Q: Why is this useful, powerful?
- Q: Multiword CAS?
- Q: LL/SC?
- Q: Problem with CAS?
- Atomically read data
- Acquire a lease - more on these in a minute
- Acquire multiple leases
- Change data while lease held
Validate cached items (think Thor)
FS metadata mods: atomic.
- This is nice: rename is often broken.
Account src{};
Account dst{};
Minitransaction t1{};
t1.read(sjcBranch, cheriton, sizeof(Account), &src);
t1.read(slcBranch, stutsman, sizeof(Account), &dst);
if (!t1.exec_and_commit) throw "GRUU!";
Minitransaction t2{};
t2.cmp(sjcBranch, cheriton, sizeof(Account), &src);
t2.cmp(slcBranch, stutsman, sizeof(Account), &dst);
auto newSrc = src;
newSrc.balance -= 1000000000;
auto newDst = dst;
newDst.balance += 1000000000;
t2.write(sjcBranch, cheriton, sizeof(Account), &newSrc);
t2.write(slcBranch, stutsman, sizeof(Account), &newDst);
if (!t2.exec_and_commit) throw "NOOO!";
Aside: Leases
Things to mention:
- What is distributed leasing/locking all about?
- Mutual exclusion or ownership.
- Distributed locking doesn't usually work due to failures.
- Especially true when clients can take out locks.
- Less control.
- Higher churn.
- Don't want to make assumptions about stable storage, etc.
- Leases: basically locks with timeouts.
- Must refresh lease to maintain ownership.
- Radio silence lets another take over safely.
- Popular since it's easy to think about:
- Don't need to think about active-active protocols like Paxos.
- Assumes bounded clock drift.
- Does not assume bounded clock skew, typically.
- Lease revocation - just ask for it back.
- What's the worst case with leases?
- Can often speculatively assume that lease will expire and perform steps to
transfer ownership, but can't correctly pass off resource until lease is
definitely expired.
- Tricky bit in leases: must make sure everything has safely stopped when lease
expires.
- This always seems to make people nervous.
- NTP or operator reset clock?
- Intel clocks aren't guaranteed monotonic...
- Be careful out there.
Caching
- App does caching itself, can use compare to validate
- App knows better what to cache, when to evict, etc.
- e.g. LRU doesn't always work well, scans, etc.
- Q: Prefer this or Thor?
- Thor doesn't have app-specific caching policy.
- Thor pushes invalidations, Sinfonia cache would have to pull.
- Pull too often, wasted work, too rarely, aborts.
Fault-tolerance
Implementation
- Q: What do they mean by "logical participant"?
- Things don't get stuck (for long) on crash.
Try locks - why do this?
- Blocking is expensive for in-memory operations.
- Avoids deadlock.
in-doubt - undecided tids
forced-abort - tids forced to abort by recovery
decided - tids and outcomes
Coordinator crashes
- Recovery coordinator occasionally looks at in-doubt lists.
- Polls participants for long running transactions.
- If all participants logged commit in redo log, then commit.
- Else abort by writing tid to forced-abort list.
- Safe even if coordinator still running or another recovey coordinator.
- Q: Why must recovery coordinator abort txns instead of commiting them if it
didn't find any abort messages?
- It knows participant list from redo log, but it doesn't know which
items are involved on the other participants, and there isn't an easy
way to figure it out. Not indexed, apparently.
Participant crashes
- Block until restart if not replicated
- Redo log replay
- How do we decide which transactions in the redo log to replay?
- Must only replay committed txns!
- But decided list is flushed async for performance.
- Contact participants to figure out missing entries.
- Finish up just like with recovery coordinator.
Filesystem
Group Communication
Broadcast channel
- Join, leave, send to all
- Everyone sees every message in a total order.
- If you refuse to "see" a message you need to leave.
Q: What is wrong with the circular queue? [Figure]
Fix: separate data and queue push to minimize contention. [Figure 10]
- Also, allows multi-push.
- Also, eliminates motion of large data across network if members on memory
nodes.
Q: Why are multiple memory nodes needed for this?
Q: Why not just have a single leader sequence messages?
Performance
Related Work
- Comparison with Thor at bottom of page
Random Questions
- 3.2: node locality? data striping? transparently mapped?
- 3.6: these are really hard, hand-wavy explanations