CS6963 Distributed Systems

Lecture 06 Distributed Transactions

Sinfonia: a new paradigm for building scalable distributed systems by Aguilera, Merchant, Shah, Veitch, and Karamanolis

  • Why this paper?

    • We've been looking at transactions.
    • This is more, but focused on real, practical systems building.
    • Interesting because this is the first claim I've seen positioning transactional KVS as a building block for large-scale systems.
      • Something I've convinced myself is a great idea recently.
      • Q: What should we be looking for in such a system?
      • Q: How does this requirement change it versus a app-facing system?
  • What is this thing?

    • A big shared storage system where small bits of data can be manipulated with minitransactions (atomic bundles of reads/writes to memory locations).
    • [figure 1]
    • Transactional updates across multiple nodes.
    • More of a memory-like abstraction than prior papers.
      • Each memory node exposes linear address space.
      • Data accessed through special pointers (memory node, offset).

Intro

  • Q: Why are they hard on message passing?
    • Wild west, no forced structure.
    • Often giving people choice is a disadvantage.
      • e.g. Linux vs OS X
    • Separate state, computation, protocols.
      • Can protect precious state.
  • Q: Why is group membership hard? Split brain.

  • Uses:

    • File systems
    • Lock managers
    • Group communication services
  • DSM

  • Database systems lack the performance

    • Let's come back to that after eval.
  • Coupling

    • What's this about?
    • Is an application that manipulates memory words tightly coupled to the memory subsystem?
      • Ask inventors of DSM.
      • Good luck detangling a program that modifies shared state.
      • Another example: running shared state programs on NUMA.
        • Small change in constant factors breaks perf, near impossible to fix an implementation because of implicit communication all over the place.
    • Apps do get tightly coupled to storage (e.g. SQL DBs), but which would you rather port?
  • Idea: big address space, minitransactions

    • Minitransactions are like compare-and-swap on steroids.
    • Batches updates.
    • Can be executed within the commit protocol.
      • i.e. no need to execute reads/writes ahead of time before commit.
      • no need for extra begin call.
    • Replication can happen in parallel with 2PC.
    • Q: Practically, how does this end up differing from Thor?
      • Have to read in one txn.
      • Perhaps use cached state.
      • Compare/write in another and check for freshness.
      • These guys claim two round trips per tx is a feat; how many did Thor have?
  • Built an NFS server and group communication.

Assumptions and Goals

  • Q: Why a single datacenter assumption? What does this change.

    • 2PC on WAN? Count on 200 ms commit times.
    • Since locks are held between Phase 1 and 2 hottest record can be updated about 10 times per second.
  • Network partitions []

  • Infrastructure apps

    • Lock manager
    • Cluster file system
      • Q: How would this change how you'd build FDS?
    • Group communication services
    • Distributed name services

Design

  • Principles
    • Reduce operation coupling to obtain scalability.
      • Q: What do they mean by this?
        • Hard to say, but e.g. indexes, table metadata, etc.
      • Q: Don't we just expect the app to have to build all that, though?
      • Q: What is the right set in built-in primitives for such a system?
    • Make components reliable before scaling them.
      • This makes scaling easier.
      • It misses opportunities, though.
      • Often, if lower-level knows semantics of higher-level then it can take shortcuts.
        • e.g. If all data operations are associative and commutative then enforcing message order, recovery replay order, replication order, etc may not matter.
      • This all increases coupling, though, so this makes sense for their goals.

Minitransactions

  • ACID: all or nothing, one valid state to another (they say, data is not corrupted), serializable, retained with high probability under some failure model
  • Actions: read/write/compare
  • 2PC
    • Q: Application nodes are the coordinator!?!?!
      • Why is this ok? JT said it's bad in Thor lecture
  • "last action does not affect"
    • What they are really saying here:
    • Usually the coordinator can call abort at any point in 2PC up until it sends the first commit message.
    • If the coordinator gives up that right and it's commit/abort decision is solely a function of the actions it sends to the participants, then the participants themselves can determine the commit/abort decision.
    • Hence, the coordinator isn't really needed except to submit the action list in Phase 1.
    • Once it's rolling, if the participants lose touch with the coordinator, they can still process the transaction to commit/abort.
    • Effectivelly, the coordinator loses its vote in this scheme.
    • Once it submits the request to the participants it's stuck with the outcome.
  • Algorithm:

    • On each memory node specified in a minitransaction:
    • Compare everything in the compare set, abort if mismatch.
    • TryReadLock/TryWriteLock everything in the read and write sets.
    • Read everything in the read set.
    • Write everything in the write set.
    • Release all locks.
  • Kind of like Thor with guts of concurrency control exposed.

  • Q: Why don't they need clock assist to totally order transactions like Thor?

  • Q: Why are the locks less problematic than in Argus?

  • Micro Examples [Figure 2]

    • Swap
    • Compare-and-swap
      • Q: Why is this useful, powerful?
      • Q: Multiword CAS?
      • Q: LL/SC?
      • Q: Problem with CAS?
    • Atomically read data
    • Acquire a lease - more on these in a minute
    • Acquire multiple leases
      • Q: Why is this useful?
    • Change data while lease held
  • Validate cached items (think Thor)

    • Check getattr freshness.
  • FS metadata mods: atomic.

    • This is nice: rename is often broken.
Account src{};
Account dst{};

Minitransaction t1{};

t1.read(sjcBranch, cheriton, sizeof(Account), &src);
t1.read(slcBranch, stutsman, sizeof(Account), &dst);

if (!t1.exec_and_commit) throw "GRUU!";

Minitransaction t2{};

t2.cmp(sjcBranch, cheriton, sizeof(Account), &src);
t2.cmp(slcBranch, stutsman, sizeof(Account), &dst);

auto newSrc = src;
newSrc.balance -= 1000000000;
auto newDst = dst;
newDst.balance += 1000000000;

t2.write(sjcBranch, cheriton, sizeof(Account), &newSrc);
t2.write(slcBranch, stutsman, sizeof(Account), &newDst);

if (!t2.exec_and_commit) throw "NOOO!";

Aside: Leases

Things to mention:

  • What is distributed leasing/locking all about?
    • Mutual exclusion or ownership.
  • Distributed locking doesn't usually work due to failures.
    • Especially true when clients can take out locks.
    • Less control.
    • Higher churn.
    • Don't want to make assumptions about stable storage, etc.
  • Leases: basically locks with timeouts.
    • Must refresh lease to maintain ownership.
    • Radio silence lets another take over safely.
  • Popular since it's easy to think about:
    • Don't need to think about active-active protocols like Paxos.
  • Assumes bounded clock drift.
  • Does not assume bounded clock skew, typically.
  • Lease revocation - just ask for it back.
  • What's the worst case with leases?
    • Can often speculatively assume that lease will expire and perform steps to transfer ownership, but can't correctly pass off resource until lease is definitely expired.
  • Tricky bit in leases: must make sure everything has safely stopped when lease expires.
    • This always seems to make people nervous.
    • NTP or operator reset clock?
    • Intel clocks aren't guaranteed monotonic...
    • Be careful out there.

Caching

  • App does caching itself, can use compare to validate
    • App knows better what to cache, when to evict, etc.
    • e.g. LRU doesn't always work well, scans, etc.
  • Q: Prefer this or Thor?
    • Thor doesn't have app-specific caching policy.
    • Thor pushes invalidations, Sinfonia cache would have to pull.
    • Pull too often, wasted work, too rarely, aborts.

Fault-tolerance

  • Q: How does it handle failures?
    • Keep working with some.
    • Stop when too many.
    • Consistent snapshots for disasters.
  • Disk images - location of record
  • Logging - crash recovery
    • Potentially to NVRAM
  • Replication - HA
  • Backup - disaster recovery

  • Primary-copy replication

    • Similar to Lab 2
    • They'll be subject to split brain under partitions
    • This is why they mention Paxos
    • 4.10 talks about this a bit more
    • They power off old primaries via lights-out management
      • What's the chance if you have a partition in your main network you have one in your control network?
      • Are they fully independent?
  • Backups

    • Describe consistent checkpoint
      • Start buffering log writes
      • Flush remaining dirty data
      • Copy off image
      • Catch up log
    • One tricky bit (4.9), all snaps across machines need to start at same logical point in transaction stream.
      • Two phase protocol: Phase 1 lock all addresses on all nodes, Phase 2 notes highest commit tx in all logs and shifts future writes to log only.

Implementation

  • Q: What do they mean by "logical participant"?
    • Things don't get stuck (for long) on crash.
  • Try locks - why do this?

    • Blocking is expensive for in-memory operations.
    • Avoids deadlock.
  • in-doubt - undecided tids

  • forced-abort - tids forced to abort by recovery

  • decided - tids and outcomes

  • Coordinator crashes

    • Recovery coordinator occasionally looks at in-doubt lists.
    • Polls participants for long running transactions.
    • If all participants logged commit in redo log, then commit.
    • Else abort by writing tid to forced-abort list.
    • Safe even if coordinator still running or another recovey coordinator.
    • Q: Why must recovery coordinator abort txns instead of commiting them if it didn't find any abort messages?
      • It knows participant list from redo log, but it doesn't know which items are involved on the other participants, and there isn't an easy way to figure it out. Not indexed, apparently.
  • Participant crashes

    • Block until restart if not replicated
    • Redo log replay
    • How do we decide which transactions in the redo log to replay?
      • Must only replay committed txns!
      • But decided list is flushed async for performance.
    • Contact participants to figure out missing entries.
      • Finish up just like with recovery coordinator.

Filesystem

  • NFSv2 protocol
  • [diagram cluster NFS processes, memory node processes]

  • Easier on Sinfonia because

    • No internode coordination
    • No need for journals (no partially applied updates)
    • Each cache status checks
    • WAL perf - they mean batching/sequential IO.
  • [Figure 6]

    • Why did they partition it this way?
    • Tries to colocate inode, chaining list, and data blocks for a file.
    • Q: How does this compare to FDS? It would have pushed data blocks apart.
  • [Figure 7]

    • Q: What is ADDR_IVERSION about?
    • Promotes CAS to LL/SC.

Group Communication

  • Broadcast channel

    • Join, leave, send to all
    • Everyone sees every message in a total order.
    • If you refuse to "see" a message you need to leave.
  • Q: What is wrong with the circular queue? [Figure]

  • Fix: separate data and queue push to minimize contention. [Figure 10]

    • Also, allows multi-push.
    • Also, eliminates motion of large data across network if members on memory nodes.
  • Q: Why are multiple memory nodes needed for this?

  • Q: Why not just have a single leader sequence messages?

Performance

  • Base performance
    • 50,000 4 byte items = 200 KB
  • Peaks at ~2,400 tx/s * 6 ops = 14,400 op/s.
    • Is this reasonable for the hardware?
      • Seems pretty awesome for a 10k disk.
      • Say, 5 ms seek latency: 200 IOPS
      • High parallelism, lots of page flushes, better throughput.
        • Accumulating redo log writes.
        • Elevator algorithm on disk image
  • Peaks at ~7,000 tx/s * 6 ops = 42,000 op/s.

    • Is this reasonable for the hardware?
    • 20 ms commit latency for in memory? (NVRAM emulated via RAM)
      • What's the expected network RTT here?
      • 1 ms worst case.
      • 125 B/us for 1 GBe
      • 4 * 6 = 24 bytes, takes less than us to transmit
      • Expect 2 RTTs = 2 ms or so?
      • Something weird/slow here.
      • Perhaps still stuck on flushing disk image.
  • Optimization breakdown

    • System 1: similar to Argus
    • 50,000 4 byte items spread over 4 nodes: 50 KB each
    • No skew...
    • Up to about 2x to operate locally.
  • Scalability

    • Common, dubious measurement.
    • Take slow system on one node
    • Measure it on two
    • Divide two node perf by one node perf
    • Give nice 'up to the right', hide embarassing y-axis units
      • Is Sinfonia guilty, or are the absolute results good?
    • What is this measuring?
      • Minitransaction spread of 2.
      • But no real contention.
      • It should scale unless unexpected coordination.
    • Each of each machine has 4 MB of state
      • About 1 GB of data total.
      • Lower perf than a modern flash drive.
      • Total system cost ~200-500k. Good deal?
      • Do we believe the results will hold for massive data?
      • How would the numbers look if we popped flash drives in?
    • [Figure 15] Highlights need to keep spread low
    • Scaling out lower perf here
    • Spread transactions as whole units
  • Contention

    • [Figure 16]
    • Good example of 'optimizism' breaking down
    • Speculative incs fail frequently
    • General 2PL works better
    • Could bake inc, but break coupling argument
      • Also raises: why stop at inc?

Related Work

  • Comparison with Thor at bottom of page

Random Questions

  • 3.2: node locality? data striping? transparently mapped?
  • 3.6: these are really hard, hand-wavy explanations