Lecture 15 Spanner

Spanner: Google's Globally-Distributed Database
Corbett et al, OSDI 2012

First: Imagine Lab 3, now imagine we want to read from any replica, and we want to do it without invoking Paxos.
- What do we give up?
- External consistency.
- How could we get that back? Need a way to sync reads to writes without writing any state on reads.
- Idea: tag writes with timestamp and keep version history for each key.
- When a read comes in, tag it with a timestamp, read last write before than timestamp and return.
- Problems: what if not all writes are applied yet at the replica?
- What if key space is split across groups (Lab 4)? Then how to we order writes as well if they aren't logged in a single Paxos log?
- Welcome to a really challenging paper.
Why this paper?
- Modern, high performance?, driven by real-world needs
- Sophisticated use of paxos
- Tackles consistency + performance (will be a big theme)
- Lab 4 is a (hugely) simplified version of Spanner
What are the big ideas?
- Shard management w/ paxos replication
- Distributed transactions
- High performance despite synchronous WAN replication
- Consistency despite sharding (this is the real focus)
- Fast reads by asking only the nearest replica
- Clever use of time for consistency
This is a dense paper!
- I've tried to boil down some of the ideas to simpler form.
- We'll mostly ignore massive parts of it.
Idea: sharding
- We've seen this before in Sinfonia, Thor, Facebook
- A serious problem is managing configuration changes
- Spanner has a more convincing design for this than FDS
Simplified sharding outline (lab 4):
- Replica groups, paxos-replicated
  - Paxos log in each replica group
- Master, paxos-replicated
  - Assigns shards to groups
  - Numbered configurations
- If master moves a shard, groups eventually see new config
- "start handoff Num=7" op in both groups' paxos logs
  - Though perhaps not at the same time
- dst can't finish handoff until it has copies of shard data at majority
  - and can't wait long for possibly-dead minority
  - minority must catch up, so perhaps put shard data in paxos log (!)
- "end handoff Num=7" op in both groups' logs
Q: What if a Put is concurrent w/ handoff?
- Client sees new config, sends Put to new group before handoff starts?
  - View id can detect this.
- Client has stale view and sends it to old group after handoff?
  - View id can detect this.
- Arrives at either during handoff?
  - Could even use view change for this.
Q: What if a failure during handoff?
- e.g. old group thinks shard is handed off but new group fails before it thinks so.
- HA though Paxos prevents this.
Q: Can two groups think they are serving a shard?
- Yes, but in different views.
Q: Could old group still serve shard if can't hear master?
Idea: wide-area synchronous replication
- Goal: survive single-site disasters
- Goal: replica near customers
- Goal: don't lose any updates
Considered impractical until a few years ago
- Paxos too expensive, so maybe primary/backup?
  - Why is it needed to solve this? Could use fast view changes/leases?
  - Then you'd still need an algorithm for 'catch-up'.
  - Paxos better than view changes over WAN since less sensitive to hair-trigger reconfigurations.
- If primary waits for ACK from backup
  - 50ms network will limit throughput and cause palpable delay
  - esp if app has to do multiple reads at 50ms each
- If primary does not wait, it will reply to client before durable
- Danger of split brain; can't make network reliable
What's changed?
- Other site may be only 5 ms away -- San Francisco / Los Angeles
- Faster/cheaper WAN
- Apps written to tolerate delays
  - May make many slow read requests
  - But issue them in parallel
  - Maybe time out quickly and try elsewhere, or redundant gets
- Huge # of concurrent clients lets you get hi thruput despite high delay
  - Run their requests in parallel
- People appreciate paxos more and have streamlined variants
  - Fewer msgs
    - Page 9 of paxos paper: 1 round per op w/ leader + bulk preprepare
    - Paper's scheme a little more involved b/c they must ensure there's at most one leader
- Read at any replica
Actual performance?
- Table 3
  - Pretend just measuring paxos for writes, read at any replica for reads latency
    - Why doesn't write latency go up w/ more replicas?
      - Can hide overlapped requests in commit delay.
    - Why does std dev of latency go down w/ more replicas?
      - Eliminates stragglers.
    - r/o a lot faster since not a paxos agreement + use closest replica
      - Need tricks to make this linearizable, though.
  - Throughput
    - Why does read throughput go up w/ # replicas?
    - Why doesn't write throughput go up?
    - Does write thruput seem to be going down?
  - What can we conclude from Table 3?
    - Is the system fast? slow?
  - How fast do your paxoses run?
    - 10 ms per agreement? with local communication and no disk
    - Spanner paxos might wait for disk write (might not)
    - Why is waiting for the disk unlikely to be important?
- Figure 5
  - npaxos=5, all leaders in same zone
  - Why does killing a non-leader in each group have no effect?
  - For killing all the leaders ("leader-hard")
    - Why flat for a few seconds?
    - What causes it to start going up?
    - Why does it take 5 to 10 seconds to recover?
    - Why is slope higher until it rejoins?
Spanner reads from any paxos replica
- read does not involve a paxos agreement
- just reads the data directly from replica's k/v DB
- maybe 100x faster -- same room rather than cross-country
Q: could we write to just one replica?
Q: is reading from any replica correct?
Q: In Lab 3, does your implementation provide linearizability (external consistency)?
- Yes, if you replicate Get ops in the log.
- What if you don't? What goes wrong?
- [Show read in the past.]
- Effectively sequential consistency/serializability, but no real-time constraint on reads in this cases.
- i.e. reads may not return writes that completed before the read was issued.
- This sets up the motivation for TrueTime...
Example of problem:
- Photo sharing site
- I have photos
- I have an ACL (access control list) saying who can see my photos
- I take my mom out of my ACL, then upload new photo
- Really it's web front ends doing these client reads/writes
  1. W1: I write ACL on group G1 (bare majority), then
  2. W2: I add image on G2 (bare majority), then
  3. Mom reads image - may get old data from lagging G2 replica
  4. Mom reads ACL - may get new data from G1
This system is not acting like a single server!
- There was not really any point at which the image was present but the ACL hadn't been updated
This problem is caused by a combination of
- Partitioning - replica groups operate independently
- Cutting corners for performance - read from any replica
How can we fix this?
1. Make reads see latest data
  - e.g. full paxos for reads expensive!
2. Make reads see consistent data
  - Data as it existed at some previous point in time
  - i.e. before #1, between #1 and #2, or after #2
  - But not with order inverted.
  - This turns out to be much cheaper
  - Spanner does this
Here's a super-simplification of spanner's consistency story for r/o clients
- "snapshot" or "lock-free" reads
- Assume for now that all the clocks agree
- Server (paxos leader) tags each write with the time at which it occurred
- KVS stores multiple values for each key, each with a different time
- Reading client picks a time t
  - For each read, ask relevant replica to do the read at time t
- How does a replica read a key at time t?
  - Return the stored value with highest time <= t
- But wait, the replica may be behind
  - That is, there may be a write at time < t, but replica hasn't seen it
    - Paxos replication can lag arbitrarily.
  - So replica must somehow be sure it has seen all writes <= t
  - Idea: has it seen any operation from time > t?
    - If yes, and paxos group always agrees on ops in time order, it's enough to check/wait for an op with time > t
    - That is what spanner does on reads (4.1.3)
- What time should a reading client pick?
  - Using current time may force lagging replicas to wait
  - So perhaps a little in the past
  - Client may miss latest updates
  - But at least it will see consistent snapshot
  - In our example, won't see new image w/o also seeing ACL update
How does that fix our ACL/image example?
1. W1: I write ACL, G1 assigns it time=10, then
2. W2: I add image, G2 assigns it time=15 (> 10 since clocks agree)
3. mom picks a time, for example t=14
4. mom reads ACL t=14 from lagging G1 replica if it hasn't seen paxos agreements up through t=14, it knows to wait so it will return W1
5. mom reads image from G2 at t=14 image may have been written on that replica but it will know to not return it since image's time is 15 other choices of t work too
Q: Is it reasonable to assume that different computers' clocks agree?
- Why might they not agree?
Q: What may go wrong if servers' clocks don't agree?
- A performance problem: reading client may pick time in the future, forcing reading replicas to wait to "catch up"
- A correctness problem:
  - Again, the ACL/image example
  - G1 and G2 disagree about what time it is (timestamps flip!)
    1. W1: I write ACL on G1 - stamped with time=15
    2. W2: I add image on G2 - stamped with time=10
  - Now a client read at t=14 will see image but not ACL update
Q: Why doesn't spanner just ensure that the clocks are all correct?
- After all, it has all those master GPS / atomic clocks?
- Drift, network delay, jitter...
TrueTime (section 3)
- There is an actual "absolute" time t_abs
  - but server clocks are typically off by some unknown amount
  - TrueTime can bound the error
- So now() yields an interval: [earliest,latest]
- Earliest and latest are ordinary scalar times
  - Perhaps microseconds since Jan 1 1970
- t_abs is highly likely to be between earliest and latest
Q: How does TrueTime choose the interval?
- Uncertainty of time sources.
- Expected clock drift.
Q: Why are GPS time receivers able to avoid this problem?
- Do they actually avoid it?
- What about the "atomic clocks"?
Spanner assigns each write a scalar time
- Might not be the actual absolute time
- But is chosen to ensure consistency
The danger:
- W1 at G1, G1's interval is [20,30]
  - Is any time in that interval OK?
- Then W2 at G2, G2's interval is [11,21]
  - Is any time in that interval OK?
- If they are not careful, might get s1=25 s2=15
- Example is worst case drift between the two clocks.
So what we want is:
- If W2 starts after W1 finishes, then s2 > s1
- Simplified "external consistency invariant" from 4.1.2
- Causes snapshot reads to see data consistent w/ true order of W1, W2
How does spanner assign times to writes? (much simplified, see 4.1.2)
- A write request arrives at paxos leader
- s will be the write's timestamp
- Leader sets s to TrueTime now().latest
  - This is "Start" in 4.1.2
- Then leader delays until s < now().earliest
  - i.e. until s is guaranteed to be in the past (compared to absolute time)
  - this is "commit wait" in 4.1.2
- Then leader runs paxos to cause the write to happen
- Then leader replies to client
- Ramification: need strictly increasing timestamps on all writes in paxos log, consequence: these commit delays are on the fast path!
Does this work for our example?
- W1 at G1, TrueTime says [20,30]
  - s1 = 30
  - Commit wait until TrueTime says [31,41]
  - Reply to client
- W2 at G2, TrueTime must now say >= [21,31]
  - (otherwise TrueTime is broken)
  - s2 = 31
  - Commit wait until TrueTime says [32,43]
  - Reply to client
- It does work for this example:
  - The client observed that W1 finished before S2 started,
  - And indeed s2 > s1
  - Even though G2's TrueTime clock was slow by the most it could be
  - So if my mom sees S2, she is guaranteed to also see W1

After this point be selective.

Why the "Start" rule?
- i.e. why choose the time at the end of the TrueTime interval?
- Previous writers waited only until their timestamps were barely < t_abs
- New writer must choose s greater than any completed write
- t_abs might be as high as now().latest for prior writes
- So s = now().latest
Why the "Commit Wait" rule?
- Ensures that s < t_abs
- i.e. ensures that s is really in the past before committing.
- Otherwise write might complete with an s in the future
  - and would let Start rule give too low an s to a subsequent write
Q: Why commit wait; why not immediately write value with chosen time?
- Indirectly forces subsequent write to have high enough s
  - the system has no other way to communicate minimum acceptable next s for writes in different replica groups
- Waiting forces writes that some external agent is serializing to have monotonically increasing timestamps
- w/o wait, our example goes back to s1=30 s2=21
- You could imagine explicit schemes to communicate last write's TS to the next write
Q: how long is the commit wait?
A large TrueTime uncertainty requires a long commit wait
- so Spanner authors are interested in accurate low-uncertainty time
Let's step back
- Why did we get into all this timestamp stuff?
  - Our replicas were 100s or 1000s of miles apart (for locality/fault tol)
  - We wanted fast reads from a local replica (no full paxos)
  - Our data was partitioned over many replica groups w/ separate clocks
  - We wanted consistency for reads:
    - If W1 then W2, reads don't see W2 but not W1
- It's complex but it makes sense as a high-performance evolution of Lab 3/4
Why is this timestamp technique interesting?
- We want to enforce order - things that happened in some order in real time are ordered the same way by the distributed system
  - "external consistency"
- The naive approach requires a central agent, or lots of communication
- Spanner does the synchronization implicitly via time
  - time can be a form of communication
  - e.g. we agree in advance to meet for dinner at 6:00pm
There's a lot of additional complexity in the paper
- Transactions, two phase commit, two phase locking, schema change, query language, etc.

CS6963 Distributed Systems

Lecture 15 Spanner