Lecture 20 Byzantine Fault Tolerance and PBFT

Practical Byzantine Fault Tolerance
Castro and Liskov
SOSP 99

Why this paper?
- Byzantine Fault Tolerance!
- Area of a fair amount of research, but less deployment.
- Important to know that this problem can even be solved: seems a bit hard to believe that it works!
What is Byzantine Fault Tolerance?
What is a Byzantine behavior?
- Any thing that doesn't follow our protocol.
- Malicious code/nodes.
- Buggy code.
- Fault networks that deliver corrupted packets.
- Disks that corrupt data, duplicate data, lose data, fabricate data.
- Nodes that impersonate others or join the cluster without permission.
- Nodes that operation when they shouldn't (e.g. clock drift outside of tolerance).
- Try to service operations on a partition even after partition was given to someone else.
- Really wicked bad stuff: any arbibtrary behavior.
- Subject to one restriction - will come to this in a minute.
Primary/backup
- Replicate state, stay alive so long at f + 1 replicas for f failures.
Paxos
- Idea: replicate a state machine, stay 'available' so long as <= f failures in cluster of 2f + 1.
- Lose up to f, others can still make decisions.
- When others come back, they catch up.
What does 3f + 1 get us then?
- SMR that tolerates up to f malcious or arbitrarily failed nodes.
Key restriction: assume independent node failures for BFT!
Q: Is this a big assumption?
Costlier than you might expect under malcious attack; requires:
- Different implementations.
- different operating systems.
- Different root passwords.
- Different administrators!!!
Why? Otherwise given one failure attacker may be able to amplify. If f + 1 nodes have the same problem then we are already toast.
Another key consideration: in Paxos, a pair of partitioned nodes may not be able to communicate directly.
We'll have the same situation here: [drawn 4 nodes in a 1, 1, 2 split].
Should keep operating.
But what if A communicates through B to C and D and B lies about A's messages!
Can B use this to amplify it's power?
i.e. we could have tolerated a faulty B, but now A is faulty by proxy.
Solution: authenticate messages with crypto.
Two goals: prevent spoofing and replays.
Spoofing can kill us in the above situation.
How might replay attack hurt us?
XXX Example here?
Use public-key signatures, message authentication codes, and message digests.
Public-key crypto/signatures.
- Each node has a public key and a private key.
- Public key is known by all other nodes.
- Private key is kept secret (at least if the node is non-Byzantine).
- Exposing a private key counts at one of the f failures.
- Node i can use it's private key to generate a message si.
- Any node with i's public key can verify that i generated the message.
- Or at least someone with access to i's private key.
- This fixes spoofing.
XXX Can skip this if it isn't used elsewhere.
Message digest/hash: one-way function.
- Easy to compute given plaintext.
- Extremely hard to generate a plaintext for a given hash/digest.
- Here: use it as a short summary of the message.
Implements SMR just like Paxos/Raft.
- [XXX: Diagram here...]
- Serial replicated log that represents a form of virtual time.
- Once entries are decided they are fed to a deterministic state machine.
- Here the state machine will be an NFS server.
XXX: Spend plenty of time here and really diagram this!!!
[Section 3, Page 3, two paragraphs before Section 4].
Why 3f + 1? Why this number? What does this buy us?
First, must be able to proceed after contacting 2f + 1 replicas.
- Why? Have to stay available if f have failed.
- Imagine 3f + 1 = 4, 2f + 1 = 3.
Of 2f + 1 responses must they all be from non-faulty nodes? NO!
What if a non-faulty node is slower than a faulty one?
Of the 2f + 1 we can count on being up, up to f of them could still be faulty!
How can we make the right decision if some of the responses are 'true' and others are 'lies'?
Idea: need to make sure the number of 'true' responses outnumber the 'lies'.
If we have 2f + 1 responses and f are lies, then we know we have f + 1 that are ok.
Result: with 3f + 1 servers, we get responses from 2f + 1, up to f of those could be faulty, but if we should f + 1 common responses which indicate the correct outcome.
Another way to think about it:
- In P/B, everyone was forced to agree, so if 1 up of f + 1 then ok.
- In Paxos, up to f could have a different/no value, so need f + 1 to be sure.
- PBFT, up to f could lie about the value and they might be among the responses we get back first. Need f + 1 true values to offset those. So, f could be lies, f + 1 good ones needed, and another f to account for other good guys that are just slower than potential bad guys.
The Algorithm
- Sequence of Views (similar to Lab 2/Lab 4) numbered consecutively.
- One replica is Primary, others are Backups.
- Primary of View v is v mod |R|.

Client sends request to primary.
Primary sends request to all backups.
Replicas execute the request and send the reply to the client.
Client waits for f + 1 responses with the same result.

What if Client gets f faulty responses that all agree?
- Then will have to wait for f + 1 more (for a total of 2f + 1 responses).
What if Client gets f + 1 fault responses that all agree?
- Cannot: violates assumption of f faulty processes.
What if Client gets f 'faulty' responses and 1 correct one and they all agree?
- Then the f responses weren't really faulty.

CS6963 Distributed Systems

Lecture 20 Byzantine Fault Tolerance and PBFT