Lecture 15 Spanner
Spanner: Google's Globally-Distributed Database
Corbett et al, OSDI 2012
First: Imagine Lab 3, now imagine we want to read from any replica, and we
want to do it without invoking Paxos.
- What do we give up?
- External consistency.
- How could we get that back? Need a way to sync reads to writes without
writing any state on reads.
- Idea: tag writes with timestamp and keep version history for each key.
- When a read comes in, tag it with a timestamp, read last write before than
timestamp and return.
- Problems: what if not all writes are applied yet at the replica?
- What if key space is split across groups (Lab 4)? Then how to we order
writes as well if they aren't logged in a single Paxos log?
- Welcome to a really challenging paper.
Why this paper?
- Modern, high performance?, driven by real-world needs
- Sophisticated use of paxos
- Tackles consistency + performance (will be a big theme)
- Lab 4 is a (hugely) simplified version of Spanner
What are the big ideas?
- Shard management w/ paxos replication
- Distributed transactions
- High performance despite synchronous WAN replication
- Consistency despite sharding (this is the real focus)
- Fast reads by asking only the nearest replica
- Clever use of time for consistency
This is a dense paper!
- I've tried to boil down some of the ideas to simpler form.
- We'll mostly ignore massive parts of it.
Idea: sharding
- We've seen this before in Sinfonia, Thor, Facebook
- A serious problem is managing configuration changes
- Spanner has a more convincing design for this than FDS
Simplified sharding outline (lab 4):
- Replica groups, paxos-replicated
- Paxos log in each replica group
- Master, paxos-replicated
- Assigns shards to groups
- Numbered configurations
- If master moves a shard, groups eventually see new config
- "start handoff Num=7" op in both groups' paxos logs
- Though perhaps not at the same time
- dst can't finish handoff until it has copies of shard data at majority
- and can't wait long for possibly-dead minority
- minority must catch up, so perhaps put shard data in paxos log (!)
- "end handoff Num=7" op in both groups' logs
Q: What if a Put is concurrent w/ handoff?
- Client sees new config, sends Put to new group before handoff starts?
- Client has stale view and sends it to old group after handoff?
- Arrives at either during handoff?
- Could even use view change for this.
Q: What if a failure during handoff?
- e.g. old group thinks shard is handed off but new group fails before it
thinks so.
- HA though Paxos prevents this.
Q: Can two groups think they are serving a shard?
- Yes, but in different views.
Q: Could old group still serve shard if can't hear master?
Idea: wide-area synchronous replication
- Goal: survive single-site disasters
- Goal: replica near customers
- Goal: don't lose any updates
Considered impractical until a few years ago
- Paxos too expensive, so maybe primary/backup?
- Why is it needed to solve this? Could use fast view changes/leases?
- Then you'd still need an algorithm for 'catch-up'.
- Paxos better than view changes over WAN since less sensitive to
hair-trigger reconfigurations.
- If primary waits for ACK from backup
- 50ms network will limit throughput and cause palpable delay
- esp if app has to do multiple reads at 50ms each
- If primary does not wait, it will reply to client before durable
- Danger of split brain; can't make network reliable
What's changed?
- Other site may be only 5 ms away -- San Francisco / Los Angeles
- Faster/cheaper WAN
- Apps written to tolerate delays
- May make many slow read requests
- But issue them in parallel
- Maybe time out quickly and try elsewhere, or redundant gets
- Huge # of concurrent clients lets you get hi thruput despite high delay
- Run their requests in parallel
- People appreciate paxos more and have streamlined variants
- Fewer msgs
- Page 9 of paxos paper: 1 round per op w/ leader + bulk preprepare
- Paper's scheme a little more involved b/c they must ensure there's at
most one leader
- Read at any replica
Actual performance?
- Table 3
- Pretend just measuring paxos for writes, read at any replica for reads
latency
- Why doesn't write latency go up w/ more replicas?
- Can hide overlapped requests in commit delay.
- Why does std dev of latency go down w/ more replicas?
- r/o a lot faster since not a paxos agreement + use closest replica
- Need tricks to make this linearizable, though.
- Throughput
- Why does read throughput go up w/ # replicas?
- Why doesn't write throughput go up?
- Does write thruput seem to be going down?
- What can we conclude from Table 3?
- Is the system fast? slow?
- How fast do your paxoses run?
- 10 ms per agreement? with local communication and no disk
- Spanner paxos might wait for disk write (might not)
- Why is waiting for the disk unlikely to be important?
- Figure 5
- npaxos=5, all leaders in same zone
- Why does killing a non-leader in each group have no effect?
- For killing all the leaders ("leader-hard")
- Why flat for a few seconds?
- What causes it to start going up?
- Why does it take 5 to 10 seconds to recover?
- Why is slope higher until it rejoins?
Spanner reads from any paxos replica
- read does not involve a paxos agreement
- just reads the data directly from replica's k/v DB
- maybe 100x faster -- same room rather than cross-country
Q: could we write to just one replica?
Q: is reading from any replica correct?
Q: In Lab 3, does your implementation provide linearizability (external
consistency)?
- Yes, if you replicate Get ops in the log.
- What if you don't? What goes wrong?
- [Show read in the past.]
- Effectively sequential consistency/serializability, but no real-time
constraint on reads in this cases.
- i.e. reads may not return writes that completed before the read was issued.
- This sets up the motivation for TrueTime...
Example of problem:
- Photo sharing site
- I have photos
- I have an ACL (access control list) saying who can see my photos
- I take my mom out of my ACL, then upload new photo
- Really it's web front ends doing these client reads/writes
- W1: I write ACL on group G1 (bare majority), then
- W2: I add image on G2 (bare majority), then
- Mom reads image - may get old data from lagging G2 replica
- Mom reads ACL - may get new data from G1
This system is not acting like a single server!
- There was not really any point at which the image was present but the ACL
hadn't been updated
This problem is caused by a combination of
- Partitioning - replica groups operate independently
- Cutting corners for performance - read from any replica
How can we fix this?
- Make reads see latest data
- e.g. full paxos for reads expensive!
- Make reads see consistent data
- Data as it existed at some previous point in time
- i.e. before #1, between #1 and #2, or after #2
- But not with order inverted.
- This turns out to be much cheaper
- Spanner does this
Here's a super-simplification of spanner's consistency story for r/o clients
- "snapshot" or "lock-free" reads
- Assume for now that all the clocks agree
- Server (paxos leader) tags each write with the time at which it occurred
- KVS stores multiple values for each key, each with a different time
- Reading client picks a time t
- For each read, ask relevant replica to do the read at time t
- How does a replica read a key at time t?
- Return the stored value with highest time <= t
- But wait, the replica may be behind
- That is, there may be a write at time < t, but replica hasn't seen it
- Paxos replication can lag arbitrarily.
- So replica must somehow be sure it has seen all writes <= t
- Idea: has it seen any operation from time > t?
- If yes, and paxos group always agrees on ops in time order,
it's enough to check/wait for an op with time > t
- That is what spanner does on reads (4.1.3)
- What time should a reading client pick?
- Using current time may force lagging replicas to wait
- So perhaps a little in the past
- Client may miss latest updates
- But at least it will see consistent snapshot
- In our example, won't see new image w/o also seeing ACL update
How does that fix our ACL/image example?
- W1: I write ACL, G1 assigns it time=10, then
- W2: I add image, G2 assigns it time=15 (> 10 since clocks agree)
- mom picks a time, for example t=14
- mom reads ACL t=14 from lagging G1 replica
if it hasn't seen paxos agreements up through t=14, it knows to wait
so it will return W1
- mom reads image from G2 at t=14
image may have been written on that replica
but it will know to not return it since image's time is 15 other
choices of t work too
Q: Is it reasonable to assume that different computers' clocks agree?
- Why might they not agree?
Q: What may go wrong if servers' clocks don't agree?
- A performance problem: reading client may pick time in the
future, forcing reading replicas to wait to "catch up"
- A correctness problem:
- Again, the ACL/image example
- G1 and G2 disagree about what time it is (timestamps flip!)
- W1: I write ACL on G1 - stamped with time=15
- W2: I add image on G2 - stamped with time=10
- Now a client read at t=14 will see image but not ACL update
Q: Why doesn't spanner just ensure that the clocks are all correct?
- After all, it has all those master GPS / atomic clocks?
- Drift, network delay, jitter...
TrueTime (section 3)
- There is an actual "absolute" time t_abs
- but server clocks are typically off by some unknown amount
- TrueTime can bound the error
- So now() yields an interval: [earliest,latest]
- Earliest and latest are ordinary scalar times
- Perhaps microseconds since Jan 1 1970
- t_abs is highly likely to be between earliest and latest
Q: How does TrueTime choose the interval?
- Uncertainty of time sources.
- Expected clock drift.
Q: Why are GPS time receivers able to avoid this problem?
- Do they actually avoid it?
- What about the "atomic clocks"?
Spanner assigns each write a scalar time
- Might not be the actual absolute time
- But is chosen to ensure consistency
The danger:
- W1 at G1, G1's interval is [20,30]
- Is any time in that interval OK?
- Then W2 at G2, G2's interval is [11,21]
- Is any time in that interval OK?
- If they are not careful, might get s1=25 s2=15
- Example is worst case drift between the two clocks.
So what we want is:
- If W2 starts after W1 finishes, then s2 > s1
- Simplified "external consistency invariant" from 4.1.2
- Causes snapshot reads to see data consistent w/ true order of W1, W2
How does spanner assign times to writes? (much simplified, see 4.1.2)
- A write request arrives at paxos leader
- s will be the write's timestamp
- Leader sets s to TrueTime now().latest
- Then leader delays until s < now().earliest
- i.e. until s is guaranteed to be in the past (compared to absolute time)
- this is "commit wait" in 4.1.2
- Then leader runs paxos to cause the write to happen
- Then leader replies to client
- Ramification: need strictly increasing timestamps on all writes in paxos
log, consequence: these commit delays are on the fast path!
Does this work for our example?
- W1 at G1, TrueTime says [20,30]
- s1 = 30
- Commit wait until TrueTime says [31,41]
- Reply to client
- W2 at G2, TrueTime must now say >= [21,31]
- (otherwise TrueTime is broken)
- s2 = 31
- Commit wait until TrueTime says [32,43]
- Reply to client
- It does work for this example:
- The client observed that W1 finished before S2 started,
- And indeed s2 > s1
- Even though G2's TrueTime clock was slow by the most it could be
- So if my mom sees S2, she is guaranteed to also see W1
After this point be selective.
Why the "Start" rule?
- i.e. why choose the time at the end of the TrueTime interval?
- Previous writers waited only until their timestamps were barely < t_abs
- New writer must choose s greater than any completed write
- t_abs might be as high as now().latest for prior writes
- So s = now().latest
Why the "Commit Wait" rule?
- Ensures that s < t_abs
- i.e. ensures that s is really in the past before committing.
- Otherwise write might complete with an s in the future
- and would let Start rule give too low an s to a subsequent write
Q: Why commit wait; why not immediately write value with chosen time?
- Indirectly forces subsequent write to have high enough s
- the system has no other way to communicate minimum acceptable next s
for writes in different replica groups
- Waiting forces writes that some external agent is serializing
to have monotonically increasing timestamps
- w/o wait, our example goes back to s1=30 s2=21
- You could imagine explicit schemes to communicate last write's TS to the
next write
Q: how long is the commit wait?
A large TrueTime uncertainty requires a long commit wait
- so Spanner authors are interested in accurate low-uncertainty time
Let's step back
- Why did we get into all this timestamp stuff?
- Our replicas were 100s or 1000s of miles apart (for locality/fault tol)
- We wanted fast reads from a local replica (no full paxos)
- Our data was partitioned over many replica groups w/ separate clocks
- We wanted consistency for reads:
- If W1 then W2, reads don't see W2 but not W1
- It's complex but it makes sense as a high-performance evolution of Lab 3/4
Why is this timestamp technique interesting?
- We want to enforce order - things that happened in some order in real time
are ordered the same way by the distributed system
- The naive approach requires a central agent, or lots of communication
- Spanner does the synchronization implicitly via time
- time can be a form of communication
- e.g. we agree in advance to meet for dinner at 6:00pm
There's a lot of additional complexity in the paper
- Transactions, two phase commit, two phase locking, schema change, query
language, etc.