## 250P: Computer Systems Architecture

# Lecture 5: Advanced Pipelines 

Anton Burtsev
January, 2019

## Hazards

- Structural hazards
- Data hazards
- Control hazards


## Control Hazards

- Simple techniques to handle control hazard stalls:
$>$ for every branch, introduce a stall cycle (note: every $6^{\text {th }}$ instruction is a branch on average!)
$>$ assume the branch is not taken and start fetching the next instruction - if the branch is taken, need hardware to cancel the effect of the wrong-path instructions
$>$ predict the next PC and fetch that instr - if the prediction is wrong, cancel the effect of the wrong-path instructions
$>$ fetch the next instruction (branch delay slot) and execute it anyway - if the instruction turns out to be on the correct path, useful work was done - if the instruction turns out to be on the wrong path, hopefully program state is not lost


## Branch delay slot

(a) From before

(b) From target

(c) From fall-through


## Multicycle Instructions



## Effects of Multicycle Instructions

- Potentially multiple writes to the register file in a cycle
- Frequent RAW hazards
- WAW hazards (WAR hazards not possible)
- Imprecise exceptions because of o-o-o instr completion

Note: Can also increase the "width" of the processor: handle multiple instructions at the same time: for example, fetch two instructions, read registers for both, execute both, etc.

## Precise Exceptions

- On an exception:
$>$ must save PC of instruction where program must resume
$>$ all instructions after that PC that might be in the pipeline must be converted to NOPs (other instructions continue to execute and may raise exceptions of their own)
$>$ temporary program state not in memory (in other words, registers) has to be stored in memory
$>$ potential problems if a later instruction has already modified memory or registers
- A processor that fulfils all the above conditions is said to provide precise exceptions (useful for debugging and of course, correctness)


## Dealing with these Effects

- Multiple writes to the register file: increase the number of ports, stall one of the writers during ID, stall one of the writers during WB (the stall will propagate)
- WAW hazards: detect the hazard during ID and stall the later instruction
- Imprecise exceptions: buffer the results if they complete early or save more pipeline state so that you can return to exactly the same state that you left at


## Slowdowns from Stalls

- Perfect pipelining with no hazards $\rightarrow$ an instruction completes every cycle (total cycles ~ num instructions) $\rightarrow$ speedup $=$ increase in clock speed $=$ num pipeline stages
- With hazards and stalls, some cycles (= stall time) go by during which no instruction completes, and then the stalled instruction completes
- Total cycles $=$ number of instructions + stall cycles
- Slowdown because of stalls = 1/ (1 + stall cycles per instr)


## Pipelining Limits





Gap between indep instrs: T + Tovh Gap between dep instrs: $\mathrm{T}+\mathrm{Tovh}$

Gap between indep instrs: T/3 + Tovh Gap between dep instrs:

T + 3Tovh

Gap between indep instrs:
T/6 + Tovh
Gap between dep instrs:
T + 6Tovh

Assume that there is a dependence where the final result of the first instruction is required before starting the second instruction

## Problem 1

- For the following code sequence, show how the instrs flow through the pipeline:

ADD R3 $\leftarrow$ R1, R2
LD $\quad \mathrm{R} 7 \leftarrow 8[\mathrm{R} 6]$
ST $\quad$ R9 $\rightarrow$ 4[R8]
BEZ R4, [R5]


## Problem 1

- For the following code sequence, show how the instrs flow through the pipeline:

ADD R3 $\leftarrow$ R1, R2
LD $\quad \mathrm{R} 7 \leftarrow 8[\mathrm{R} 6]$
ST $\quad$ R9 $\rightarrow 4[\mathrm{R} 8]$
BEZ R4, [R5]


## Pipeline Summary

RR
ALU
Rd R1,R2 R1+R2

Rd R1, R5
Compare, Set PC
LD $\mathrm{R} 6 \leftarrow 8[\mathrm{R} 3] \quad \mathrm{Rd}$ R3 $\mathrm{R} 3+8$ Get data Wr R6
ST R6 $\rightarrow$ 8[R3] Rd R3,R6 R3+8 Wr data --

## Problem 2

- Convert this C code into equivalent RISC assembly instructions

$$
\mathrm{a}[\mathrm{i}]=\mathrm{b}[\mathrm{i}]+\mathrm{c}[\mathrm{i}] ;
$$

## Problem 2

- Convert this C code into equivalent RISC assembly instructions
$a[i]=b[i]+c[i] ;$
LD R2, [R1] \# R1 has the address for variable i
MUL R3, R2, 8 \# the offset from the start of the array
ADD R7, R3, R4 \# R4 has the address of a[0]
ADD R8, R3, R5 \# R5 has the address of b[0]
ADD R9, R3, R6 \# R6 has the address of c[0]
LD R10, [R8] \# Bringing b[i]
LD R11, [R9] \# Bringing c[i]
ADD R12, R11, R10 \# Sum is in R12
ST R12, [R7] \# Putting result in a[i]


## Problem 3

- Show the instruction occupying each stage in each cycle (no bypassing) if I1 is R1 $+\mathrm{R} 2 \rightarrow \mathrm{R} 3$ and I 2 is $\mathrm{R} 3+\mathrm{R} 4 \rightarrow \mathrm{R} 5$ and I 3 is $\mathrm{R} 7+\mathrm{R} 8 \rightarrow \mathrm{R} 9$

| CYC-1 | CYC-2 | CYC-3 | CYC-4 | CYC-5 | CYC-6 | CYC-7 | CYC-8 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| IF | IF | IF | IF | IF | IF | IF | IF |
| D/R | D/R | D/R | D/R | D/R | D/R | D/R | D/R |
| ALU | ALU | ALU | ALU | ALU | ALU | ALU | ALU |
| DM | DM | DM | DM | DM | DM | DM | DM |
| RW | RW | RW | RW | RW | RW | RW | RW |

## Problem 3

- Show the instruction occupying each stage in each cycle (no bypassing) if I 1 is $\mathrm{R} 1+\mathrm{R} 2 \rightarrow \mathrm{R} 3$ and I 2 is $\mathrm{R} 3+\mathrm{R} 4 \rightarrow \mathrm{R} 5$ and I 3 is $\mathrm{R} 7+\mathrm{R} 8 \rightarrow \mathrm{R} 9$
CYC-1 CYC-2 CYC-3 CYC-4 CYC-5 CYC-6 CYC-7 CYC-8

| $\begin{aligned} & \text { IF } \\ & \text { I1 } \end{aligned}$ | $\begin{aligned} & \text { IF } \\ & \text { I2 } \end{aligned}$ | IF | $\begin{aligned} & \text { IF } \\ & \text { I3 } \end{aligned}$ | $\begin{aligned} & \text { IF } \\ & \text { I3 } \end{aligned}$ | IF | $\begin{aligned} & \text { IF } \\ & \text { I5 } \end{aligned}$ | IF |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| D/R | D/R | D/R | D/R | D/R | D/R | D/R | D/R |
| ALU | ALU | ALU | ALU | ALU | ALU | ALU | ALU |
| DM | DM | DM | DM | DM | DM | DM | DM |
| RW | RW | RW | RW | RW | RW | RW | RW |

## Bypassing: 5-Stage Pipeline

Time (in clock cycles)


## Problem 4

- Show the instruction occupying each stage in each cycle (with bypassing) if I 1 is $\mathrm{R} 1+\mathrm{R} 2 \rightarrow \mathrm{R} 3$ and I 2 is $\mathrm{R} 3+\mathrm{R} 4 \rightarrow \mathrm{R} 5$ and I 3 is $\mathrm{R} 3+\mathrm{R} 8 \rightarrow \mathrm{R} 9$. Identify the input latch for each input operand.

| CYC-1 | CYC-2 | CYC-3 | CYC-4 | CYC-5 | CYC-6 | CYC-7 | CYC-8 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| IF | IF | IF | IF | IF | IF | IF | IF |
| D/R | D/R | D/R | D/R | D/R | D/R | D/R | D/R |
| ALU | ALU | ALU | ALU | ALU | ALU | ALU | ALU |
| DM | DM | DM | DM | DM | DM | DM | DM |
| RW | RW | RW | RW | RW | RW | RW | RW |

## Problem 4

- Show the instruction occupying each stage in each cycle (with bypassing) if I 1 is $\mathrm{R} 1+\mathrm{R} 2 \rightarrow \mathrm{R} 3$ and I 2 is $\mathrm{R} 3+\mathrm{R} 4 \rightarrow \mathrm{R} 5$ and I 3 is $\mathrm{R} 3+\mathrm{R} 8 \rightarrow \mathrm{R} 9$. Identify the input latch for each input operand.

| CYC-1 | CYC-2 | CYC-3 | CYC-4 | CYC-5 | CYC-6 | CYC-7 | CYC-8 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| IF | IF | IF | IF | IF | IF | IF | IF |
| 11 | 12 | 13 | 14 | 15 |  |  |  |
| D/R | D/R | D/R | D/R | D/R | D/R | D/R | D/R |
|  | I1 | 12 | 13 | 14 |  |  |  |
| ALU | ALU | ALU | ALU | ALU | ALU | ALU | ALU |
|  |  | 11 | 12 | 13 |  |  |  |
| DM | DM | DM | DM | DM | DM | DM | DM |
|  |  |  | 11 | 12 | 13 |  |  |
| RW | RW | RW | RW | RW | RW | RW | RW |
|  |  |  |  |  |  |  |  |

Thank you!

