# **BRANCH PREDICTORS**

Mahdi Nazm Bojnordi

Assistant Professor

School of Computing

University of Utah

THE

OF UTAH

CS/ECE 6810: Computer Architecture UNIVERSITY



#### Announcements

■ Homework 3 release: Sept. 25<sup>th</sup>

#### This lecture

- Dynamic branch prediction
- Counter based branch predictor
- Correlating branch predictor
- Global vs. local branch predictors

# **Big Picture: Why Branch Prediction?**

- Problem: performance is mainly limited by the number of instructions fetched per second
- Solution: deeper and wider frontend
- Challenge: handling branch instructions



# **Big Picture: How to Predict Branch?**

- Static prediction (based on direction or profile)
  - Always not-taken
    - $\Box$  Target = next PC
  - Always taken
    - $\Box$  Target = unknown
- Dynamic prediction
  - Special hardware using PC



#### **Recall: Dynamic Branch Prediction**

Hardware unit capable of learning at runtime

- **1**. Prediction logic
  - Direction (taken or not-taken)
  - Target address (where to fetch next)
- 2. Outcome validation and training
   Outcome is computed regardless of prediction
- 3. Recovery from misprediction
   Nullify the effect of instructions on the wrong path

#### **Branch Prediction**

- Goal: avoiding stall cycles caused by branches
- Solution: static or dynamic branch predictor
  - 1. prediction
  - 2. validation and training
  - 3. recovery from misprediction
- Performance is influenced by the frequency of branches (b), prediction accuracy (a), and misprediction cost (c)

#### **Branch Prediction**

- Goal: avoiding stall cycles caused by branches
- Solution: static or dynamic branch predictor
  - 1. prediction
  - 2. validation and training
  - 3. recovery from misprediction
- Performance is influenced by the frequency of branches (b), prediction accuracy (a), and misprediction cost (c)

$$Speedup = \frac{Old Time}{New Time} = \frac{CPI_{old}}{CPI_{new}} = \frac{1+bc}{1+(1-a)bc}$$

#### Problem

- A pipelined processor requires 3 stall cycles to compute the outcome of every branch before fetching next instruction; due to perfect forwarding/bypassing, no stall cycles are required for data/structural hazards; every 5<sup>th</sup> instruction is a branch.
  - Compute speedup gained by a branch predictor with 90% accuracy

#### Problem

- A pipelined processor requires 3 stall cycles to compute the outcome of every branch before fetching next instruction; due to perfect forwarding/bypassing, no stall cycles are required for data/structural hazards; every 5<sup>th</sup> instruction is a branch.
  - Compute speedup gained by a branch predictor with 90% accuracy

Speedup =  $(1 + 0.2 \times 3) / (1 + 0.1 \times 0.2 \times 3) = 1.5$ 

#### One-bit branch predictor



#### One-bit branch predictor



#### One-bit branch predictor



One-bit branch predictor





- Two-bit branch predictor
  - Increment if taken
  - Decrement if untaken

Two-bit branch predictor
 Increment if taken
 Decrement if untaken



- Two-bit branch predictor
   Increment if taken
   Decrement if untaken
  - One misprediction on loop exit
- Accuracy = 28/30 = 0.93



- Two-bit branch predictor
   Increment if taken
   Decrement if untaken
  - One misprediction on loop exit
- Accuracy = 28/30 = 0.93
- How to improve?
  - 3-bit predictor?
- Problem?
  - A single predictor shared among many branches



□ How to assign a branch to each counter?



□ How to assign a branch to each counter?



□ How to assign a branch to each counter?



. . .

. . .

. . .

branch-1

branch-2

branch-3

How to assign a branch to each counter?



How to assign a branch to each counter?

Decode History Table (DHT)

Reduced HW with aliasing

**Program code** 

. . .

. . .

. . .

branch-1

branch-2

branch-3



How to assign a branch to each counter?

Decode History Table (DHT)

Reduced HW with aliasing



#### Program code

... branch-1 ... branch-2 ... branch-3

□ How to assign a branch to each counter?

Decode History Table (DHT)

Reduced HW with aliasing

 $Cost = n2^{b}$  bits

... branch-1 ... branch-2 ... branch-3

**Program code** 



□ How to assign a branch to each counter?

Decode History Table (DHT)

Reduced HW with aliasing

Branch History Table (BHT)

Precisely tracking branches

Most significant bits are used as tags



How to assign a branch to each counter?

Decode History Table (DHT)

Reduced HW with aliasing

Branch History Table (BHT)

Precisely tracking branches

Most significant bits are used as tags (+) No aliasing (-) Missing entries



□ How to assign a branch to each counter?

Decode History Table (DHT) Reduced HW with aliasing Branch History Table (BHT) Precisely tracking branches Most significant bits are used as tags (+) No aliasing (-) Missing entries



□ How to assign a branch to each counter?

Decode History Table (DHT)

Reduced HW with aliasing

Branch History Table (BHT)

Precisely tracking branches

Combined BHT and DHT

- BHT is used on a hit
- DHT is used/updated on a miss



□ How to assign a branch to each counter?

Decode History Table (DHT)

Reduced HW with aliasing

Branch History Table (BHT)

Precisely tracking branches

Combined BHT and DHT

- BHT is used on a hit
- DHT is used/updated on a miss

 $Cost = (a-b+2n)2^{b}$  bits



□ How to assign a branch to each counter?

Decode History Table (DHT)

Reduced HW with aliasing

Branch History Table (BHT)

Precisely tracking branches

Combined BHT and DHT

- BHT is used on a hit
- DHT is used/updated on a miss

DHT typically has more entries than BHT

 $Cost = (a-b+2n)2^{b}$  bits



Executed branches of a program stream may be correlated

while (1) {
 if(x == 0)
 y = 0;
 ...
 if(y == 0)
 x = 1;
}

| while (1) {<br>if(x == 0)<br>y = 0; | branch-1 |
|-------------------------------------|----------|
| <br>if(y == 0)<br>x = 1;            | branch-2 |
| }                                   |          |

```
while:
BNEQ R1, R0, skp1
ADDI R2, R0, #0
skp1: ...
BNEQ R2, R0, skp2
ADDI R1, R0, #1
skp2: J while
```

| while (1) {                   | Global History Register: an r-bit shift register |
|-------------------------------|--------------------------------------------------|
| if(x == 0)                    | that maintains outcome history                   |
| y = 0;                        | branch-1                                         |
| <br>if(y == 0)<br>x = 1;<br>} | branch-2                                         |





#### **Global Branch Predictor**

□ GHR is merged with PC bits to choose a counter



#### **Global Branch Predictor**

□ GHR is merged with PC bits to choose a counter



#### **Global Branch Predictor**

□ GHR is merged with PC bits to choose a counter



#### □ One GHR per branch



#### □ One GHR per branch







#### □ One GHR per branch



#### **Tournament Branch Predictor**

- Local predictor may work well for some applications, while global predictor works well for some other programs
  - Include both and identify/use the best one for each branch



Two bit saturating counters

#### **Branch Prediction Summary**

- Dedicated predictor per branch
  - Program counter is used for assigning predictors to branches
- Capturing correlation among branches
   Shift register is used to track history
- Predicting branch direction is not enough
   Which instruction to be fetched if taken?
- Storing the target instruction can eliminate fetching
   Extra hardware is required

#### Branch Target Buffer

Store tags and target addresses for each branch



#### Branch Target Buffer

Store tags and target addresses for each branch

