Synchronization and Concurrency Control in Distributed Systems (2/14/2000) ============================================================================ Clock synchronization: In general, you cannot perfectly synchronize distributed clocks - no instantaneous communication - individual clocks have "skew" (phase distortions) Options: - approximate clock synchronization - depends on latency of communication - logical clocks (Lamport ordering) Lamport ordering: What you get is a partial ordering of events happens-before relationship (->): 1. If A and B are events in the same process, a A occurs before B, then A->B is true 2. If A is the event of a message being sent by one process, and B is the event of the message being received by a second process, then A-> is true. Note: a message cannot be received before it is sent happens-before is transitive (A->B, B->C implies A->C) If neither A->B or B->A, then A and B are called "concurrent". We've used this ordering idea before! - DSM (consistency models) - Fault tolerance (consistent checkpoints) - File systems/databases (versioning) - consistency ============================================================================== Distributed mutual exclusion: In CS5460, we talked some about uniprocessor mutual exclusion: - critical sections - arbitrary thread scheduling - three-phase proof of mutual exclusion - semaphores, locks, condition variables, barriers, monitors, ... - readers/writers, bounded buffers, dining philosophers (lawyers),... How does this extend into the distributed world? Possibilities: * centralized manager + trivial to implement + easy to queue requests (fairness) + fast if low contention (always two hops) - performance problems - bottleneck - no cheap re-use (caching) - single point of failure * completely distributed algorithm (from book) * if you want to enter CS, broadcast request w/ logical time * if you receive request: * if in CS, wait to reply until exit * if not in and not interested, reply "OK" * if trying to enter, compare time stamps + no central bottleneck - lousy performance - too many messages - group membership management - lots of points of failure - no re-use - fault intolerant * token-based schemes (stupid one from book or Munin's) * pass around token representing right to particular CS * keep track of "probable token holder" * request token when trying to enter CS, forward request as needed + good performance + no bottlenecks + automatic reuse + few messages (Tarjan-style amortization) - somewhat more complicated to implement - token recovery ============================================================================ Elections: Simple example: to determine that the coordinator (or token) is gone, and select a new one. Possibilities: * bully algorithm (highest numbered process wins) * ring algorithm * democratic election (first to notice and begin wins, use pids to break ties) Basic idea: need an "incarnation number" to detect case where failed process recovers (or network departitions). ============================================================================= ANNOUNCEMENTS: - Project proposals due by Friday afternoon * 5-10 pages - Abstract - Introduction (problem statement, motivation, overview) - Specific proposed work (incl. proposed tests and evaluation) - Schedule (with *specific* milestones) - Project proposal presentations next Monday * 10 minutes per individual group - Similar format to proposal document - Specific example uses (combination of motivation and proposed evaluation) * Expect suggestions and comments from the audience! ============================================================================ Distributed transactions: Basic properties: ACID Atomic Consistent Isolated (serializable) Durable Nested Transactions ============================================================================= Implementing Distributed Transactions PROBLEM ONE: How to make changes isolated and atomic? Issue: any changes made by a transaction should only be visible to that transaction or any nested subtransaction. Intentions logs (aka writeahead logs): - modify files (records) in place - record a `change record' in a log on stable storage whenever you modify data - change record includes old and new values Example (from book): Execution: Log after each statement ---------- ------------------------ x = y = 0; BEGIN_TRANSACTION x = x + 1; x: {0,1} y = y + 2; x: {0,1}, y: {0,2} x = y * y; x: {0,1}, y: {0,2}, x: {1,4} END_TRANSACTION If transaction commits: - write `COMMIT' record to log - if not already done, propagate changes to `real' data If transaction aborts : - need to roll back changes - start from the end of the log, work your way backwards - apply inverse of logged changes Log can also be used for crash recovery - undo uncommitted transactions Alternative implementation: shadow blocks (shadow resources) PROBLEM TWO: How do we ensure atomicity across machines? Issue: No obvious single operation that demarcates `yes/no' decision, ala the log write in a single node database. Conventional solution: two phase commit (Gray 1978) Select one COORDINATOR and N COHORTS (subordinates) PHASE ONE: Coordinator: Cohort(s): ------------ ---------- P 1. Write PREPARE record in log H O 2. Multicast PREPARE message to A N all cohorts -------------------> S E 3. Write READY record in log E 5. Collect replies <--------------- 4. Reply OK to coordinator ------------------------------------------------ P 6. Write COMMITTED record in log (****) 7. Multicast COMMIT record to all cohorts -------------------> 8. Write COMMIT record in log 9. Commit changes 11. Collect replies <------------- 10. Reply OK to coordinator The point marked (****) is the ATOMIC COMMIT POINT. If any system crashes before this point, the transaction aborts. If any system crashes after this point, it will complete (eventually). Note: two-phase commit is known to have poor performance if crashes are really frequent. Three-phase commit is used in this case. ============================================================================= Optimistic versus Pessimistic Concurrency Control Pessimistic: - idea: ensure no conflicts occur - lock-based concurrency control - deadlocks are a real problem Optimistic: - idea: assume no conflicts, and act accordingly - concurrency control based on time stamps - detect conflicts at commit time -- abort conflicted transactions ============================================================================= Some details Pessimistic (locking) -- transactions acquire locks before using a resource. If a transaction completes, the new versions of the protected data overwrite the older versions, and the locks are all released. Need to guarantee serializability. Two-phase locking: * Divide execution into GROWING PHASE and SHRINKING PHASE. * During growing phase, process acquires all of the locks it will require (cannot modify protected data). If it cannot acquire a lock, it releases all locks, delays, and starts over. * During shrinking phase, process can modify protected data and release locks Variant: strict two-phase locking * system acquires locks as side-effect of accessing data * processes modify local copies of protected data * at end of transaction, local copies overwrite saved ones (via intentions log and two-phase commit) and all locks are released * always serializable * eliminates "cascading aborts" Issue: Granularity of locking (size, r/w, ...) Optimistic concurrency control: Idea: Individual processes don't worry about potential concurrency (serializability problems) -- just barrel ahead and let things sort themselves out later (politician's ideal solution). Implementation: * Keep track of data a process reads. If it is changed by a different process before this one commits, abort transaction when it tries to commit. Assumes private copies of data. + Maximum concurrency -- conflicts are rare + Deadlock free - Potential for lots of wasted work, especially as workload increases - Cascaded rollbacks (another way to state above problem) Timestamps: variant of optimistic concurrency control * Every transaction gets a logical timestamp when it starts * Maintain read and write timestamps with each data item (file), denoting logical timestamp of last *committed* transaction to read/write it * If a read or write is attempted, compare transaction's timestamp with timestamp of file: * If file is older, everything is ok. * If file is younger, serializabilty error - abort transaction