CS6963 Distributed Systems

Lecture 15 Spanner

Spanner: Google's Globally-Distributed Database
Corbett et al, OSDI 2012

  • First: Imagine Lab 3, now imagine we want to read from any replica, and we want to do it without invoking Paxos.

    • What do we give up?
    • External consistency.
    • How could we get that back? Need a way to sync reads to writes without writing any state on reads.
    • Idea: tag writes with timestamp and keep version history for each key.
    • When a read comes in, tag it with a timestamp, read last write before than timestamp and return.
    • Problems: what if not all writes are applied yet at the replica?
    • What if key space is split across groups (Lab 4)? Then how to we order writes as well if they aren't logged in a single Paxos log?
    • Welcome to a really challenging paper.
  • Why this paper?

    • Modern, high performance?, driven by real-world needs
    • Sophisticated use of paxos
    • Tackles consistency + performance (will be a big theme)
    • Lab 4 is a (hugely) simplified version of Spanner
  • What are the big ideas?

    • Shard management w/ paxos replication
    • Distributed transactions
    • High performance despite synchronous WAN replication
    • Consistency despite sharding (this is the real focus)
    • Fast reads by asking only the nearest replica
    • Clever use of time for consistency
  • This is a dense paper!

    • I've tried to boil down some of the ideas to simpler form.
    • We'll mostly ignore massive parts of it.
  • Idea: sharding

    • We've seen this before in Sinfonia, Thor, Facebook
    • A serious problem is managing configuration changes
    • Spanner has a more convincing design for this than FDS
  • Simplified sharding outline (lab 4):

    • Replica groups, paxos-replicated
      • Paxos log in each replica group
    • Master, paxos-replicated
      • Assigns shards to groups
      • Numbered configurations
    • If master moves a shard, groups eventually see new config
    • "start handoff Num=7" op in both groups' paxos logs
      • Though perhaps not at the same time
    • dst can't finish handoff until it has copies of shard data at majority
      • and can't wait long for possibly-dead minority
      • minority must catch up, so perhaps put shard data in paxos log (!)
    • "end handoff Num=7" op in both groups' logs
  • Q: What if a Put is concurrent w/ handoff?

    • Client sees new config, sends Put to new group before handoff starts?
      • View id can detect this.
    • Client has stale view and sends it to old group after handoff?
      • View id can detect this.
    • Arrives at either during handoff?
      • Could even use view change for this.
  • Q: What if a failure during handoff?

    • e.g. old group thinks shard is handed off but new group fails before it thinks so.
    • HA though Paxos prevents this.
  • Q: Can two groups think they are serving a shard?

    • Yes, but in different views.
  • Q: Could old group still serve shard if can't hear master?

  • Idea: wide-area synchronous replication

    • Goal: survive single-site disasters
    • Goal: replica near customers
    • Goal: don't lose any updates
  • Considered impractical until a few years ago

    • Paxos too expensive, so maybe primary/backup?
      • Why is it needed to solve this? Could use fast view changes/leases?
      • Then you'd still need an algorithm for 'catch-up'.
      • Paxos better than view changes over WAN since less sensitive to hair-trigger reconfigurations.
    • If primary waits for ACK from backup
      • 50ms network will limit throughput and cause palpable delay
      • esp if app has to do multiple reads at 50ms each
    • If primary does not wait, it will reply to client before durable
    • Danger of split brain; can't make network reliable
  • What's changed?

    • Other site may be only 5 ms away -- San Francisco / Los Angeles
    • Faster/cheaper WAN
    • Apps written to tolerate delays
      • May make many slow read requests
      • But issue them in parallel
      • Maybe time out quickly and try elsewhere, or redundant gets
    • Huge # of concurrent clients lets you get hi thruput despite high delay
      • Run their requests in parallel
    • People appreciate paxos more and have streamlined variants
      • Fewer msgs
        • Page 9 of paxos paper: 1 round per op w/ leader + bulk preprepare
        • Paper's scheme a little more involved b/c they must ensure there's at most one leader
    • Read at any replica
  • Actual performance?

    • Table 3
      • Pretend just measuring paxos for writes, read at any replica for reads latency
        • Why doesn't write latency go up w/ more replicas?
          • Can hide overlapped requests in commit delay.
        • Why does std dev of latency go down w/ more replicas?
          • Eliminates stragglers.
        • r/o a lot faster since not a paxos agreement + use closest replica
          • Need tricks to make this linearizable, though.
      • Throughput
        • Why does read throughput go up w/ # replicas?
        • Why doesn't write throughput go up?
        • Does write thruput seem to be going down?
      • What can we conclude from Table 3?
        • Is the system fast? slow?
      • How fast do your paxoses run?
        • 10 ms per agreement? with local communication and no disk
        • Spanner paxos might wait for disk write (might not)
        • Why is waiting for the disk unlikely to be important?
    • Figure 5
      • npaxos=5, all leaders in same zone
      • Why does killing a non-leader in each group have no effect?
      • For killing all the leaders ("leader-hard")
        • Why flat for a few seconds?
        • What causes it to start going up?
        • Why does it take 5 to 10 seconds to recover?
        • Why is slope higher until it rejoins?
  • Spanner reads from any paxos replica

    • read does not involve a paxos agreement
    • just reads the data directly from replica's k/v DB
    • maybe 100x faster -- same room rather than cross-country
  • Q: could we write to just one replica?

  • Q: is reading from any replica correct?

  • Q: In Lab 3, does your implementation provide linearizability (external consistency)?

    • Yes, if you replicate Get ops in the log.
    • What if you don't? What goes wrong?
    • [Show read in the past.]
    • Effectively sequential consistency/serializability, but no real-time constraint on reads in this cases.
    • i.e. reads may not return writes that completed before the read was issued.
    • This sets up the motivation for TrueTime...
  • Example of problem:

    • Photo sharing site
    • I have photos
    • I have an ACL (access control list) saying who can see my photos
    • I take my mom out of my ACL, then upload new photo
    • Really it's web front ends doing these client reads/writes
      1. W1: I write ACL on group G1 (bare majority), then
      2. W2: I add image on G2 (bare majority), then
      3. Mom reads image - may get old data from lagging G2 replica
      4. Mom reads ACL - may get new data from G1
  • This system is not acting like a single server!

    • There was not really any point at which the image was present but the ACL hadn't been updated
  • This problem is caused by a combination of

    • Partitioning - replica groups operate independently
    • Cutting corners for performance - read from any replica
  • How can we fix this?

    1. Make reads see latest data
      • e.g. full paxos for reads expensive!
    2. Make reads see consistent data
      • Data as it existed at some previous point in time
      • i.e. before #1, between #1 and #2, or after #2
      • But not with order inverted.
      • This turns out to be much cheaper
      • Spanner does this
  • Here's a super-simplification of spanner's consistency story for r/o clients

    • "snapshot" or "lock-free" reads
    • Assume for now that all the clocks agree
    • Server (paxos leader) tags each write with the time at which it occurred
    • KVS stores multiple values for each key, each with a different time
    • Reading client picks a time t
      • For each read, ask relevant replica to do the read at time t
    • How does a replica read a key at time t?
      • Return the stored value with highest time <= t
    • But wait, the replica may be behind
      • That is, there may be a write at time < t, but replica hasn't seen it
        • Paxos replication can lag arbitrarily.
      • So replica must somehow be sure it has seen all writes <= t
      • Idea: has it seen any operation from time > t?
        • If yes, and paxos group always agrees on ops in time order, it's enough to check/wait for an op with time > t
        • That is what spanner does on reads (4.1.3)
    • What time should a reading client pick?
      • Using current time may force lagging replicas to wait
      • So perhaps a little in the past
      • Client may miss latest updates
      • But at least it will see consistent snapshot
      • In our example, won't see new image w/o also seeing ACL update
  • How does that fix our ACL/image example?

    1. W1: I write ACL, G1 assigns it time=10, then
    2. W2: I add image, G2 assigns it time=15 (> 10 since clocks agree)
    3. mom picks a time, for example t=14
    4. mom reads ACL t=14 from lagging G1 replica if it hasn't seen paxos agreements up through t=14, it knows to wait so it will return W1
    5. mom reads image from G2 at t=14 image may have been written on that replica but it will know to not return it since image's time is 15 other choices of t work too
  • Q: Is it reasonable to assume that different computers' clocks agree?

    • Why might they not agree?
  • Q: What may go wrong if servers' clocks don't agree?

    • A performance problem: reading client may pick time in the future, forcing reading replicas to wait to "catch up"
    • A correctness problem:
      • Again, the ACL/image example
      • G1 and G2 disagree about what time it is (timestamps flip!)
        1. W1: I write ACL on G1 - stamped with time=15
        2. W2: I add image on G2 - stamped with time=10
      • Now a client read at t=14 will see image but not ACL update
  • Q: Why doesn't spanner just ensure that the clocks are all correct?

    • After all, it has all those master GPS / atomic clocks?
    • Drift, network delay, jitter...
  • TrueTime (section 3)

    • There is an actual "absolute" time t_abs
      • but server clocks are typically off by some unknown amount
      • TrueTime can bound the error
    • So now() yields an interval: [earliest,latest]
    • Earliest and latest are ordinary scalar times
      • Perhaps microseconds since Jan 1 1970
    • t_abs is highly likely to be between earliest and latest
  • Q: How does TrueTime choose the interval?

    • Uncertainty of time sources.
    • Expected clock drift.
  • Q: Why are GPS time receivers able to avoid this problem?

    • Do they actually avoid it?
    • What about the "atomic clocks"?
  • Spanner assigns each write a scalar time

    • Might not be the actual absolute time
    • But is chosen to ensure consistency
  • The danger:

    • W1 at G1, G1's interval is [20,30]
      • Is any time in that interval OK?
    • Then W2 at G2, G2's interval is [11,21]
      • Is any time in that interval OK?
    • If they are not careful, might get s1=25 s2=15
    • Example is worst case drift between the two clocks.
  • So what we want is:

    • If W2 starts after W1 finishes, then s2 > s1
    • Simplified "external consistency invariant" from 4.1.2
    • Causes snapshot reads to see data consistent w/ true order of W1, W2
  • How does spanner assign times to writes? (much simplified, see 4.1.2)

    • A write request arrives at paxos leader
    • s will be the write's timestamp
    • Leader sets s to TrueTime now().latest
      • This is "Start" in 4.1.2
    • Then leader delays until s < now().earliest
      • i.e. until s is guaranteed to be in the past (compared to absolute time)
      • this is "commit wait" in 4.1.2
    • Then leader runs paxos to cause the write to happen
    • Then leader replies to client
    • Ramification: need strictly increasing timestamps on all writes in paxos log, consequence: these commit delays are on the fast path!
  • Does this work for our example?

    • W1 at G1, TrueTime says [20,30]
      • s1 = 30
      • Commit wait until TrueTime says [31,41]
      • Reply to client
    • W2 at G2, TrueTime must now say >= [21,31]
      • (otherwise TrueTime is broken)
      • s2 = 31
      • Commit wait until TrueTime says [32,43]
      • Reply to client
    • It does work for this example:
      • The client observed that W1 finished before S2 started,
      • And indeed s2 > s1
      • Even though G2's TrueTime clock was slow by the most it could be
      • So if my mom sees S2, she is guaranteed to also see W1

After this point be selective.

  • Why the "Start" rule?

    • i.e. why choose the time at the end of the TrueTime interval?
    • Previous writers waited only until their timestamps were barely < t_abs
    • New writer must choose s greater than any completed write
    • t_abs might be as high as now().latest for prior writes
    • So s = now().latest
  • Why the "Commit Wait" rule?

    • Ensures that s < t_abs
    • i.e. ensures that s is really in the past before committing.
    • Otherwise write might complete with an s in the future
      • and would let Start rule give too low an s to a subsequent write
  • Q: Why commit wait; why not immediately write value with chosen time?

    • Indirectly forces subsequent write to have high enough s
      • the system has no other way to communicate minimum acceptable next s for writes in different replica groups
    • Waiting forces writes that some external agent is serializing to have monotonically increasing timestamps
    • w/o wait, our example goes back to s1=30 s2=21
    • You could imagine explicit schemes to communicate last write's TS to the next write
  • Q: how long is the commit wait?

  • A large TrueTime uncertainty requires a long commit wait

    • so Spanner authors are interested in accurate low-uncertainty time
  • Let's step back

    • Why did we get into all this timestamp stuff?
      • Our replicas were 100s or 1000s of miles apart (for locality/fault tol)
      • We wanted fast reads from a local replica (no full paxos)
      • Our data was partitioned over many replica groups w/ separate clocks
      • We wanted consistency for reads:
        • If W1 then W2, reads don't see W2 but not W1
    • It's complex but it makes sense as a high-performance evolution of Lab 3/4
  • Why is this timestamp technique interesting?

    • We want to enforce order - things that happened in some order in real time are ordered the same way by the distributed system
      • "external consistency"
    • The naive approach requires a central agent, or lots of communication
    • Spanner does the synchronization implicitly via time
      • time can be a form of communication
      • e.g. we agree in advance to meet for dinner at 6:00pm
  • There's a lot of additional complexity in the paper

    • Transactions, two phase commit, two phase locking, schema change, query language, etc.