Overview of Distributed Shared Memory (4/29/2000) ============================================================================ We've talked about a lot of related subjects: - concurrency control - consistency - clustered services - synchronization DSM brings many of these topics together under one roof. =========================================================================== DISTRIBUTED SHARED MEMORY Definition: DISTRIBUTED SHARED MEMORY allows shared memory parallel programs to be run on distributed memory multiprocessors, e.g., networks of workstations or PCs. There are a number of important questions to consider: - What memory CONSISTENCY MODEL does the DSM system support? - What COHERENCY PROTOCOLS does the DSM system support? - What is the granularity at which consistency is maintained? - How is shared data located? - How are modifications to shared data detected? - What kind of synchronization primitives are supported? We'll explore these questions through a series of case studies. Definition: A MEMORY CONSISTENCY MODEL defines a set of rules between software that runs on top of a shared memory system and then hardware or software that implements the shared memory. From the PROGRAMMER'S POINT OF VIEW, it affects how easy it is to write correct and efficient shared memory programs. From the DSM implementer's point of view, it detemrines the type of performance optimizations that can be exploited. =========================================================================== IVY: First DSM system, developed by Li and Hudak in the mid-to-late 80's. Basic idea: implement conventional multiprocessor cache coherency protocol in software (page faults and messages) Consistency model: sequential consistency Granularity : virtual memory pages Data location : distributed directory Modifications : detected via page faults Synchronization : not integrated (locks) SEQUENTIAL CONSISTENCY: All other nodes (processes) see the results of every memory operation (load store) performed by any processor in the same order. --> highly related to notion of SERIALIZABLE transactions --> basically, provides uniprocessor behavior on multiprocessor Most common implementation: MESI protocol - ala bus-based multiprocessors M: modified (exclusive+dirty) E: exclusive S: shared I: invalid Ivy directory management: - distributed directory (no central site knows everything) - "probable owner" (collapse prob owner chains: Tarjan amortization) - "copyset": set of nodes with a copy of a piece of data - last writer is considered the "owner", and has the currenty copyset Overview of Ivy operation: - single writer or multiple readers - use page-level memory protection to enforce consistency Scenario One: read access to page not on local node - page fault (non-local pages are mapped INVALID) - directory lookup --> follow probable owner chain - if owner has EXCLUSIVE copy, it downgrades state to SHARED (remaps to read-only mode) - owner sends copy to faulting process - local node maps page in read-only mode after loading data Scenario Two: write to shared page - page fault (protection violation) - directory lookup --> follow probable owner chain - current owner sends ownership to writing node - new owner INVALIDATES all remote copies - remaps page to read-write mode Issues: - works well for coarse-grained programs w/ little sharing - cannot handle FALSE SHARING ("sharing" of different objects within the same consistency unit (page)) - performance lousy for fine-grained programs, or those with significant amounts of true or false sharing - sequential consistency requires large amounts of communication to be implemented (Lipton and Sandberg) - ping-pong effect =========================================================================== MUNIN: First release consistent DSM system, developed by Carter in the early 90's. Consistency model: release consistency ("eager") - note: multiple COHERENCY PROTOCOLS Granularity : cross between variables and virtual memory pages Data location : distributed directory Modifications : detected via page faults Synchronization : integrated (locks, barriers, and condition variables) OBSERVATION (Dubois et al): Programmers use synchronization operations (e.g., locks) to control the order of events, even on uniprocessors, because the cannot control when processes are scheduled. They proposed WEAK CONSISTENCY. Process A Process B --------- --------- Mutex_Begin Mutex_Begin x = x + 1; x = x + 100; y = x + 2; y = x + 200; z = x + y; z = x + y; Mutex_End; Mutex_End; RELEASE CONSISTENCY (Intuitive): All modifications to shared data made within a critical section must be made visible to other processors that might access the data prior to completion of the critical section. RELEASE CONSISTENCY (Formal): A shared memory system is release consistent iff it obeys the following rules (Gharachorloo, DASH): 1. Before any ordinary read or write access to shared data may be performed, all previous acquires done by the process must have completed successfully. 2. Before a release operation may be performed, all previous ordinary reads and writes done by the process must have been performed. 3. Acquire and release operations must be performed in "processor consistency" order. Overview of Munin operation: - objects have sharing types associated with them (e.g., read_only, migratory, read-write, write-shared, ...) - WRITE-SHARED is the most interesting protocol: - multiple writer, multiple reader - changes are queued until end of critical section - writes to shared do not block writing process - delayed update queue to determine what must be sent - twins and diffs used to extract changed bytes - - udpates to same node combined into fewer messages Directory management: - mostly Ivy-like - distributed copyset for write-shared - non-owners can respond to read requests - need to propagate updates to your "children" Issues: - significantly reduced amount of communication compared to Ivy - almost as fast as hand-coded message passing - eliminates ping-pong effect - long delay at release point - could combine synchronization and data movement - some programming burden DASH: RC exploited by pipelining write invalidations =========================================================================== TREADMARKS: First lazy release consistent DSM system, developed by Keleher in the early-to-mid 90's. Consistency model: release consistency ("lazy") Granularity : virtual memory pages Data location : distributed directory Modifications : detected via page faults Synchronization : integrated (locks, barriers, and condition variables) LAZY RELEASE CONSISTENCY: Variant of RELEASE CONSISTENCY that says that the results of the ordinary reads and writes in the critical section can be lazily propagated. They do not need to arrive until the corresponding "ACQUIRE" operation. Overview of Treadmarks operation: - write-invalidate based - stored diffs - used vector timestamps to determine when to apply new diffs at release points Issues: - showed significant bandwidth reduction compared to Munin for some applications - significantly increased DSM system complexity and memory overhead - garbage collection - almost as fast as hardware DSM in one study - probably the state of the art =========================================================================== MIDWAY: First entry consistent DSM system, developed by Zekauskas in the early-to-mid 90's. Consistency model: entry consistency Granularity : objects Data location : distributed directory Modifications : managed as part of object method invocation Synchronization : ??? ENTRY CONSISTENCY: Use information about object invocations to determine (i) when to guarantee consistency and (ii) what data might change. Overview of Midway operation: - does NOT use vm system (page faults) to detect modifications - on invocation, get most up to date copy - upon exit, calculate changes Issues: - requires strict object-oriented programming style - does not allow partial sharing of objects - cannot program easily for some algorithms (e.g., matrix comps) =========================================================================== SHASTA: Supports unmodified parallel binaries, developed by DEC in the late 90's. Consistency model: sequential consistency Granularity : words Data location : distributed directory Modifications : detected via annotated code (binary rewriting) Synchronization : unintegrated (locks, barriers, and condition variables) Overview of Shasta operation: - used binary rewriting - every read/write instruction that could be the first access to a piece of shared data changed to do an explicit check - otherwise, standard write-invalidate coherency protocol Issues: - was able to run Oracle Parallel Server (!!!), albeit dead slow - performance is terrible, but maybe it can be improved =========================================================================== ADMINISTRAVIA: * Get groups to sign up for the presentation slots: - 5-10 minutes per group member (more for solo projects) - Expected outline of the talk: 1. Problem statement and motivation (2-3 slides) 2. Proposed solution (1-2 slides) 3. Important implementation details (0-? slides) 4. Experimental setup (1-2 slides) 5. Results (1-6 slides) 6. Analysis and conclusions (1-3 slides) 7. Future work (optional, 0-1 slides) =========================================================================== WEB CACHING Contrast how caching works in the following scenarios: * web client and web server (only) - client caches HTML, gif, jpg files in local filesystem - different clients on the same network do not benefit from inter-client access locality - server can control what content is cacheable (data might not be cacheable if it is dynamically generated, user-specific, or something that the server wants to track) * web client, client-side proxy server, and web server - clients make requests to proxy server - if proxy has requested data, it is returned directly - otherwise, proxy requests and caches data, then returns it - inter-client locality exploited - can amortize large RAM cache in proxy across many clients - proxy can become a bottleneck * web client, cooperative client caching, and web server - clients first check private cache - if content not in private cache, see if any other "local" client has a copy ==> how? ICP? Bloom filters? Broadcasts? - if not local client has it, request it from server and cache * web client, cooperative proxy caches, and web server - similar to the above, but the proxy is the one that checks with its peers (or parents) - Akamai is somewhat like this Key issue in cooperative caching: Q: How do peers convey what they have cached? A: Cao et al propose clever scheme based on "bloom filters". Peers exchange bitmaps representing whether or not any cached object hashes to a particularly bitmap. No false negatives, but false positives are possible. By choosing the right hash functions and bitmap sizes, you can do a good job of avoiding false positives. Other issues: - ad tracking -> servers WANT to see (some) hits! - dynamic content (cgi scripts, perl scripts, java, java scripts) - copyright issues - scalability: how many levels of hierarchy, where should the servers be placed, how many servers per set of clients? ==> recent work from UWash indicates that metropolitan scale proxies are a good compromise in terms of increasing hit rates and achievable implementations