Lecture Notes CS/EE 3810 Chapter 9: Multiprocessors (Lectures 24, 25, 26) Microprocessors can generally be classified into four forms. Most multiprocessors today are of the form MIMD -- each CPU operates instructions independently and on arbitrary sets of data, hence, multiple-instruction-multiple-data. Multiprocessors are primarily distinguished by the memory organization. The first organization is known as a symmetric multiprocessor (SMP) or uniform memory access (UMA) architecture. All CPUs are connected to a single set of memory chips. Every CPU has an identical view of memory and equal memory access times. If many CPUs share a single centralized memory, it can emerge as a bottleneck. Hence, for scalable designs, a distributed memory organization is used. When a CPU has a miss in its cache, it hopefully will find data in its local memory. If data is not found there, memory associated with a remote node will have to be accessed. This is a non-uniform memory access (NUMA) architecture as memory latency is a function of the physical location of the memory. A system is said to be cache coherent if it fulfils two conditions: (i) write propagation: a write by one process is eventually visible to other processes, (ii) write serialization: every process sees two writes to the same memory location in the same order. Cache coherence protocols are either based on snooping mechanisms (where every cache monitors the requests made by other caches and updates its own state) or directory-based mechanisms (where a centralized directory keeps track of how a memory block is being shared and all requests are sent to this directory). Protocols are also classified based on whether a write causes other cached copies to be invalidated or updated. The latter is more bandwidth- intensive, while the former can impose a longer latency on a read. Consider an example for a snooping-based cache coherence protocol. A single centralized memory is used and a bus connects all CPUs to this memory. When a request is put out on the bus, every cache monitors the request and takes the required steps. At the outset, memory location X is not in any of the caches. When P1 attempts to read X, a cache miss is encountered and a read request is put on the bus. Every cache checks to see if it has a local copy of X. Since the result in memory is up-to-date, memory responds with X and it is cached by Cache-1 in "Shared" state. When P2 tries to read X, the read request is put on the bus, Cache-1 snoops the bus, checks its tags and realizes that it has a valid copy of that block. In some protocols, memory always responds with the block if it has a valid copy (this is what we'll assume for the rest of this discussion). In other protocols, cache-to-cache sharing is encouraged because cache latencies are lower than memory latencies. This will require us to designate one of the sharers as the "owner" so that only that node responds with a copy. In this case, memory responds and cache-2 stores the block in its cache in "Shared" state. When P1 attempts to write X, it discovers a miss because Cache-1 has the block in shared state, which only grants read permissions. Hence, a request is placed on the bus asking for exclusive access. Cache-2 realizes that it has a copy of X and marks that block as invalid. Cache-1 marks the block as exclusive and proceeds with a write. There is no data block transfer as Cache-1 already has a valid copy of the block (this is simply a permission upgrade request). On subsequent writes to X by P1, no bus traffic is generated as Cache-1 already has the block in exclusive state. When P2 attempts to read X, it broadcasts its request, receives a valid copy from Cache-1, Cache-1 downgrades its state from Excl to Shared. At this point, a writeback to memory also happens as memory is responsible for supplying valid data if the block is cached by others in shared state. If multiple processes issue writes at the same time, they arbitrate for the bus and one of them will end up accessing the bus first. Thus, the order in which writes appear on the bus determines the order in which all processes see different writes (they all see writes in the same order). Obviously, bus-based systems have little scalability -- the bus is a centralized resource and will not work well if 100 processors regularly compete for the bus. Hence, larger scale multiprocessors employ directory-based protocols. Consider the distributed-memory system. For every "block" (say, 64 bytes) in memory, some state is maintained in an adjoining directory. The directory itself is quite large and is usually implemented as DRAM -- hence, the memory and directory are usually looked up in parallel. The directory keeps track of how a block is being shared within the entire multiprocessor. Every cache miss is now sent to the directory through a network and the directory takes necessary actions. Each cache can no longer be expected to update its own state as misses are not broadcast to everyone. If multiple processors are attempting a write at the same time, order is determined by the order in which those requests arrive at the directory. Consider an example. Assume that physical memory location X is stored on the second node (B). When A tries to read X and has a cache miss, the request is sent to the second node. Memory and directory are looked up, and the block is returned to A. The directory keeps track of the fact that A has the block in shared (read-only) state. Similarly, for B and C. When A attempts to write X, it has a cache miss (since it does not have write permissions), and it sends the request to the directory. The directory responds with the permission and sends messages to B and C, letting them know that they must invalidate their blocks. Note that the directory has to maintain a bit for every processor (for every block) to keep track of whether that processor has the block in shared state (some optimizations are possible here). When C attempts a write to X, the request is sent to the directory, the directory forwards the request to A as A has the latest copy and A is responsible for sending the latest copy of the block to C. When B attempts a read of X, the request is again forwarded to C and C responds with data. At this point, we can also write-back the data into memory. See the lecture slide that summarizes all the actions taken on each event. Locks are a basic primitive within every parallel program. They ensure that two threads can access shared variables in a co-ordinated manner. The example on the slides shows a potential error in a bank transaction if two threads try to access account information without co-ordination with locks. The two reads happen first and the two writes happen later. The second write over-writes the result produced by the first write. If the code for the two threads is surrounded by lock_acquire and lock_release instructions, the deposits will work correctly. In order to construct a lock, the hardware must provide a basic atomic read-modify-write operation. An example is the atomic exchange operation, that swaps the contents of register and memory without any other operation intervening. If the content of the register is initially one, this operation is known as "test and set". The code on the slide ensures that a process can enter the critical section (CS) only if it finds a zero (meaning lock is free) in the memory location. The process will keep spinning and attempting test and sets until it finds a zero in the memory location. The lock is released by writing a zero into the memory location. Until now, we have examined coherence, which requires two conditions to be met: write propagation and write serialization (to a single memory location). The consistency model defines the ordering of writes and reads to different memory locations. Hardware is designed to exhibit a specific consistency model and the programmer must understand it and accordingly write correct programs. The consistency model that is easiest for the programmer to understand is sequential consistency (SC). A multiprocessor is said to be SC if the results of the execution are as if each process completed each of its memory operations atomically and in program order, and the operations of different processes are interleaved in some arbitrary fashion. Thus, there are two main constraints that need to be fulfilled -- program order and atomicity. Consider the parallel program example on the slides. If the programmer assumed the SC model while writing the programs, this is what he/she would expect of the programs. The program would implement mutual exclusion (both processes cannot enter the CS at the same time). This will indeed happen if every instruction in every program completes entirely before we move on to the next instruction in every program. However, to improve performance, we often introduce optimizations. For example, we may use an out-of-order processor. For the example, such a processor can execute the if-condition first before it writes the one since the if and the write do not have any RAW/WAR/WAW dependence (the two instructions refer to different locations). If this happens, both processes could end up in the critical section at the same time. A similar result would happen even if we used an in-order processor, but assumed a write buffer. The write is placed in the write buffer and the processor moves on to the next instruction even though the rest of the world has not seen the write. The bottomline is this: SC requires program order, write serialization, and everyone has to see an update before that value can be read. This makes programming very intuitive, but the hardware very slow. To work around this problem, relaxed consistency models have been designed -- they make programming a little harder, but greatly boost performance. Programs can be written with either a shared-memory programming model or a message-passing programming model. In a message-passing model, each thread accesses its own (disjoint) set of physical memory locations and to communicate data between threads, explicit messages have to be sent. In a shared-memory model, each thread can access any physical memory location, so data can be communicated between threads if one thread writes to a specific location and the other thread reads from that location. Either programming model will work easily on an SMP (UMA) multiprocessor. On a distributed-memory multiprocessor, again, a message-passing model works fine. For a shared-memory program to work, there has to be a mechanism to allow a thread to write to remote memory locations, so that threads can exchange data by simply writing to and reading from the same memory location. One option is to use a hardware mechanism that recognizes that the physical memory address being written to is in a different node and forward the write on. The other option is to use a software layer -- the OS can recognize that the write needs to be sent to a different node and messages are exchanged between the operating systems to effect the write. The difference between these two implementations is the usual cost-performance trade-off. In essence, the message-passing programming model expects the programmer to implement cache coherence at the software level by explicitly exchanging data through send and receive messages. In a shared-memory programming model, the programmer expects that cache coherence will be implemented by the layer below (either hardware or software). To better understand the differences between the two programming models, consider how the single-thread program on the slide is written with shared-memory and message-passing. The single-thread program walks through a 2d array and re-computes the value of each element by averaging the values of neighboring elements. The program stops when values converge (difference between old and new values is less than a pre-specified threshold). In the shared-memory program, the array is created in physical memory space that is visible to every thread. A number of parallel threads are created, that all execute function Solve(). Each thread has a different pid and that is used by each thread to determine a subset of rows that it will operate on. Each thread will need the new values computed by a neighboring CPU for its border rows. These are automatically propagated by the underlying cache coherence system -- when one CPU writes to the border rows, the neighboring CPU receives the latest value on its next read. Each thread computes its own diff and all diffs are added together at the end of the iteration to determine if the computation has converged. Each thread obtains a lock before updating the global diff counter. Because of the underlying cache coherence mechanism, the latest value of the diff counter is visible to every thread. When a thread reaches a barrier, it has to wait there until all threads reach that barrier. Barriers are used between every pair of reads and writes to "diff" so that every thread gets to see the value of diff before it gets updated again. The barrier at the start of the while-loop makes sure that every thread is done resetting diff to zero before individual threads start incrementing diff after executing the for-loops. With the message-passing model, each thread creates its own local physical memory space that only it can access. Each thread also creates a local copy of the border rows -- when the neighboring CPU computes new values for these border rows, explicit messages are sent to keep the local copies up-to-date. At the start of each iteration, every thread sends its border rows to its neighboring CPUs and likewise, receives their border rows. After going through the averaging step, mydiff values are sent to the thread with pid 0. The thread with pid 0 computes the global diff value and sends it back to each thread so they know if the algorithm has converged or not. No "barriers" are required -- sends and receives are enough to make sure that all processes have advanced the appropriate amount. In this example, non-blocking sends and blocking receives are used -- a thread can execute a send and move on to the next instruction even if the corresponding receive has not been executed -- but a thread waits until a receive has completed before moving on to the next instruction. Note that the three versions of the program on the slides can all yield different final results. For the single-thread model, for each averaging step, the newly computed values of the top and left neighbor are used, while the old values of the bottom and right neighbor are used. For the message-passing model, old values are used for all but the left neighbor. For the shared-memory model, it is hard to predict if old or new values are employed for the top and bottom neighbors (it depends on the relative speeds of the threads on neighboring CPUs).