Lecture Notes CS/EE 3810 Consider an unpipelined processor implementation. The first instruction is fetched, decoded, executed, and it finally produces a result. When all circuits have stabilized, a clock is provided so that the result can be stored in a latch (register). At that point, the processor can start working on the second instruction, and so on. If it takes time t1 to complete one instruction, the clock cycle time is t1+t0 (where t0 is the time taken to store a result in the latch), the clock speed is 1/(t1+t0), a new instruction enters/leaves the processor every 1 cycle (time t1+t0). Next, we'll try to speed up the processor with pipelining. The execution of the instruction is broken up into three equal stages: instruction fetch and decode; register read and ALU compute; register write. These stages are labeled A, B, and C on the slide. Stage A begins by operating on instruction 1. After all circuits have stabilized, a clock is provided and the result is stored in a latch. This result will remain in the latch until the next clock edge is provided. This latch essentially maintains a constant input to the second stage B until the next clock edge. The whole operation took 3 cycles, where the time between clock edges equals (t1/3)+t0. While instruction 1 is in stage B, instruction 2 can be operated on by stage A. This introduces parallelism within the processor. In the steady state, stage C will operate on instruction i, stage B will operate on instruction i+1, and stage A will operate on instruction i+2. Hence, up to 3 instructions can be simultaneously processed. How does this result in better performance? A new instruction can leave/enter the pipeline every clock cycle. Since the clock speed has nearly tripled, the pipelined implementation can be expected to have nearly thrice the throughput. Note that this has happened in spite of the fact that a single instruction now takes longer to execute (t1 + 3*t0) than in the unpipelined case (t1 + t0). The "magical" improvement with pipelining came about because we assumed that there was no dependence between the operations that happened in parallel. What if there was a dependence? Consider two successive instructions: R1 <-- R2 + R3; R5 <-- R1 + R4. Instruction 1 writes the value into R1 during stage C. Hence, only at the end of stage C are we guaranteed to have the new value residing in the register file. Meanwhile, instruction 2 must read the value of R1 at the start of stage B so it can complete the addition in stage B. If you observe the orange boxes on the slide, you'll notice that the read of R1 (in stage B of instr2) happens before the write of R1 (in stage C of instr1), meaning that instruction 2 ends up reading some old value of R1. Because of this dependence, instruction 2 cannot begin in the second cycle. It must begin in the third cycle. Hence, if we have two back-to-back dependent instructions, we can expect one stall cycle (a cycle where no result is produced). Therefore, the time gap between two independent instructions leaving the pipeline is (t1/3 + t0) and the time gap between dependent instructions leaving the pipeline is (2*t1/3 + 2*t0). If t0 is relatively small, we can safely conclude that the pipelined implementation is faster than the unpipelined implementation, where the time gap between successive instructions (whether dependent or not) is always t1+t0. Consider the following effects of pipelining -- let's compare an unpipelined design against a 3-stage pipeline. As a result of pipelining, the time taken to execute an instruction (in pico seconds) increases -- this is because the instruction has to navigate more latches. The time taken to execute an instruction (in cycles) also increases -- a single instruction finishes in one cycle in the unpipelined case, but takes three cycles to finish a single instruction in the pipelined case. The cycle time in the pipelined case is nearly 3X shorter. A new instruction rolls out of the processor every cycle in both cases (assuming no stalls), so a large number of instructions can be executed with an average CPI of 1. Execution time is CPI x #instructions x cycle time. Note that pipelining is able to achieve a nearly 3X speedup because it reduces cycle time by roughly 3X. The source of this improvement is the higher parallelism in pipelining (the ability to process multiple different instructions at the same time). A reminder of the metrics: cycle time is expressed in seconds, clock speed is expressed in Hertz (which is just 1/seconds), throughput is expressed as instructions per second (or billion instructions per second). Next, we will design a pipeline that is more complex than the 3-stage example we used previously. The slides show a 5-stage pipeline. The first stage (Instruction Memory) brings in an instruction from some storage (either cache or memory). The instruction is then stored in the latch. In the second stage, the instruction is decoded and the appropriate input register values are read from the register file and stored in the latch. The ALU then operates on these register values and produces a result. If the instruction is simply trying to do ALU arithmetic, it can skip the fourth stage. If the instruction is a load or store operation, the third stage is used for computing the memory address. This memory address is then used in the fourth stage to actually fetch/store a value from/to memory. In the fifth stage, the result of the ALU computation or the load instruction is finally stored into the register file. Now consider a series of complications with this pipeline implementation. We will assume that a branch instruction completes in the second stage -- in other words, the second stage reads register values, makes a comparison, figures out which way the branch is going, and has time to update the PC with the new target. This means, we can start fetching from the correct location in cycle 3. But what about the instruction that was fetched in cycle 2? We will later examine multiple ways to handle this instruction, but keep in mind for now that branches pose this problem. If it takes even longer to resolve a branch, more cycles will go by without us knowing what instruction to fetch. These are known as control hazards. The second potential complication with the above pipeline is the conflict for storage. In cycle 4 in the figure, instruction 1 is trying to access data storage, while instruction 4 is trying to access instruction storage. If the processor had a single unified storage for both (for example, a unified cache with a single access port), both operations cannot simultaneously happen. This would give rise to stalls. In fact, performance would be slowed down by a factor of 2. If the processor has separate storage for instructions and data, a new instruction can enter/leave the pipeline every cycle. If there is a unified storage, 3 instructions enter the pipeline in the first 3 cycles, and then there are 3 stall cycles when no instructions can enter the pipeline because of conflict for the storage. This is known as a structural hazard and can be usually easily dealt with by throwing more hardware at the problem -- for example, separate caches for instructions and data. Another example of a structural hazard is posed by the register file. In cycle 5 in the figure, instruction 1 is attempting to write to the register file while instruction 4 is attempting to read from the register file. By allowing multiple read and write ports, we can do away with the resource conflict. For the rest of this discussion, we'll make a slightly different assumption. We'll assume that read and write are operations that take half a clock cycle. So writes are completed in the first half of the cycle and reads are completed in the second half of the cycle, thereby posing no conflict. (In modern-day processors, this is not a reasonable assumption as reads and writes can often take more than one full cycle.) Finally, we'll examine the third type of hazard: the data hazard. As we had seen before, we may have to introduce gaps between successive instructions if there is a dependence. Assume the first (producer) instruction writes the result into register $2 during cycle 5 (based on our assumption above, the write completes half-way through the cycle). All the subsequent instructions read $2 from the register file. The instruction that begins in the 2nd cycle reads $2 during cycle 3, so it clearly receives some old value. Hence, for the pipeline to work correctly, there have to be two cycles of inactivity between the first and second instructions. These two cycles of inactivity (or bubbles) are created by forcing the second instruction to stall in the second stage. The second stage is provisioned with a decode unit that knows how long to hold back the instruction. The second instruction finally proceeds after its third attempt at reading the register file. We can reduce these stalls with a technique called bypassing/forwarding/short-circuiting. Even though the result of an operation is written into the register file in the fifth stage, the value is often known earlier. For an integer add, the value that is being written is known as early as at the end of stage 3. The result in the latch can be forwarded to the ALU and a multiplexor unit before the ALU selects the right input -- it can select the value in the latch after stage 2 (whatever it read from the register file) or the value in the latch after stage 3 (whatever was produced by the previous instruction). We can extend the concept further and have the multiplexor also select the value in the latch after stage 4 (whatever was produced by the previous-to-previous instruction). Clearly, we must also provide some control bits to the multiplexor so it can make the appropriate selection of inputs. This "forwarding" mechanism "bypasses" the register file and allows dependent instructions to execute in back-to-back cycles. Of course, bypassing does not eliminate all stalls. For example, if the first instruction was a load, the result is produced only at the end of stage 4. Since a dependent instruction needs its input at the start of stage 3, it is impossible to forward the result of the load to the ALU in time. Hence, to detect if we need stalls between dependent instructions, we have to identify when the first instruction produces its result and when the second instruction needs that particular result. Control Hazards On a branch, we have the following options: (i) Don't do anything until we figure out which way the branch is going. (ii) Assume the branch is either taken or not taken and start fetching from the new target. If there is a mis-predict, again we have two options: (a) squash the mis-fetched instruction (b) let the mis-fetched instruction complete -- the completed instruction is said to be in the branch delay slot. The compiler is responsible for placing an instruction in the branch delay slot such that it does useful work most of the time and never causes wrong results. The slides show two ways in which the delay slot can be filled with useful instructions. In the first example, an instruction before the branch can be moved after the branch -- we always need its result and luckily it is independent of the branch condition. In the second example, the instruction before the branch cannot be moved after the branch as the branch condition depends on it. Hence, we try to move some later instruction into that slot. Somehow (perhaps with profiling), we figured out that the branch is often taken -- so we place an instruction from the taken part into the delay slot. This instruction happens to write to $t4. If it turns out that the branch was not taken, we may go the other way. If we try to read the value of $t4, we end up getting an incorrect value. Hence, we can move a later instruction into the delay slot only if it writes to a register that is not live (in other words, if we end up going the other way, at least we didn't destroy relevant state). Instead of a default prediction of taken or not-taken, a branch predictor can dynamically attempt to capture the common case behavior of a branch and predict its outcome. Such dynamic branch prediction can have very high accuracies, especially for regular branch behavior such as that in loops. In order to make this prediction at run-time, we maintain a "cache" of recent branch outcomes. For example, if I know which way a branch went the last time it was in the pipeline, I can predict that it will have the same behavior and get a pretty high prediction accuracy. Consider the following simple branch predictor. We maintain 1024 1-bit values. The last 10 bits of the branch PC are used to index into one of these 1-bit values. The 1-bit value stores which way the branch went last time (1 for taken and 0 for not-taken) and that is used as the prediction. If multiple branches map to the same 1-bit value, there will be interference -- a larger table of 1-bit values will probably have a higher prediction accuracy. By using 2 bits per entry in the branch predictor, we can capture more than just what happened the last time we hit this branch -- it helps us capture the common branch outcome in recent history. Each entry of the branch predictor stores a value between 0 and 3. Every taken branch increments the value (never going above 3) and every not-taken branch decrements the value (never going below 0). If the value is 0 or 1, the next branch is predicted as not-taken, else it is predicted taken. The advantage of a 2-bit saturating counter is that it helps mitigate the effects of noise. If a branch consistently goes one way, an occasional branch that goes the other way will not influence the branch prediction (helps reduce the number of mispredicts for a loop). Similarly, if two branches map to the same entry in the predictor, the effects of interference will be reduced. We can also employ more than 2 bits per entry, but most designs have deemed it cost-effective to only employ 2 bits. If the branch is predicted not-taken, fetch continues at PC+4. If the branch is predicted taken, we must also compute what the branch target is so that the PC can be updated. Hence, in addition to a branch-direction predictor, it is also necessary to have a branch target predictor. Typically, it keeps track of where the branch went the last time and that is good enough most of the time. If a pipeline has no stalls, a new instruction leaves the pipeline every cycle to yield a CPI of 1. In a realistic pipeline, performance (CPI) can be expressed by 1 + stalls-per-instruction.