Branch Prediction

1 / 69

# Branch Prediction - PowerPoint PPT Presentation

Branch Prediction. J. Nelson Amaral. Why Branch Prediction?. Every 5-7 instruction of a program is a branch Not predicting, or miss-predicting, is very costly in architectures with deep pipelines or with many functional units. Baer p. 129. Anatomy of a Predictor. Baer p. 130.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## Branch Prediction

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Branch Prediction

J. Nelson Amaral

Why Branch Prediction?
• Every 5-7 instruction of a program is a branch
• Not predicting, or miss-predicting, is very costly in architectures with deep pipelines or with many functional units.

Baer p. 129

Anatomy of a Branch Predictor

Prog. Exec.

• Event Source: the execution of the program
• Predictive information:
• Can be encoded in the instruction code
• a bit indicates most likely outcome
• forward/backward branch
• Obtained from some profiling information

Baer p. 130

Anatomy of a Branch Predictor (cont.)

Event Selec.

• Event Selection: when to predict?
• Simple solution: compute the prediction for every instruction (even non-branches)
• Only use the result of the prediction for branches

Baer p. 130

Anatomy of a Branch Predictor (cont.)

Pred. Index.

• Prediction Indexing:
• Use part of the PC to index prediction tables:
• history of outcome of previous branches at this PC
• history of execution path leading to this PC

Baer p. 130

Anatomy of a Branch Predictor (cont.)
• Predictor Mechanism:
• Static (example):
• forward: always not taken
• backward: always taken
• Dynamic:
• Finite State Machine predictor: saturating counters
• Markov predictor: correlation

Pred. Mechan.

Baer p. 131

Anatomy of a Branch Predictor (cont.)
• Feedback and Recovery:
• Use real outcome to reinforce prediction
• Must recover from miss-predictions

Feedback

Baer p. 131

Control Flow Statistics

A 4-way superscalar has to predict a branch, on average,

every other cycle.

Baer p. 131

Interbranch Distances

40% of the time there is 1 or 0 cycles between

predictions

Branch resolution takes +/- 10 cycles

If the prediction is wrong, up to 40 wrong

instructions are in flight by the time the

resolution occurs.

Simulation for a 4-way out-of-order architecture

Baer p. 131

Static Predictions

OR

Always Taken

Always Not Taken

Baer p. 132

Static Predictions
• Early studies indicated that 2/3 of branches are taken
• but 30% of those branches were unconditional!
• For conditional branches there appears to be no preferred direction.

Always Taken

Baer p. 132

Alternative Static Predictions

Accuracy improvements

are barely noticeable.

Static prediction based on

profiling is slightly better.

Static branch-not-taken

has no implementation

cost on pipeline.

Forward Always Not Taken

Backward Always Taken

Baer p. 132

Dynamic Predictors
• Prediction of a given branch changes with the execution of the program.
• Simple: a finite-state machine encodes the outcome of a few recent executions of the branch.
• Elaborate: Not only early branch outcomes, but other correlated parts of the programs are considered.

Baer p. 132

When to predict?
• Static prediction: at the Instruction Decode stage
• Know that the instruction is a branch
• Dynamic prediction: at the Instruction Fetch stage
• Calculate prediction for every instruction, even non-branch ones.

Baer p. 133

What to Predict?
• Branch Direction: Is branch taken on not?
• Branch Target: Address of next instruction for a taken branch

Baer p. 133

Predicting Direction
• Where we find the prediction?
• How to encode the prediction?

Look at the recent past:

What was the direction the last time this same

branch was executed?

A single bit encodes the prediction:

Prediction bit is set at prediction time.

Baer p. 133

Prediction Hysteresis
• Look at the last two resolutions
• Two wrong predictions are necessary to change the prediction
• Motivated by wrong predictions at the end of inner loops.

Baer p. 133

2-Bit Saturating Counter

Last instance

was not taken but the

previous was taken

Last two instances

were taken

Last instance

was taken but the

previous was not

Last two instances

were not taken

Baer p. 134

2-Bit Saturating Counter (Example)

for(i=0 ; i < m ; i++)

for(j=0; j<n ; j++)

begin S1; S2; …; Sk end;

m ≤ 0

i ← 0

1-bit

n ≥ 0

i

j

Pred

Outc

0

0

NT

T

j ← 0

0

1

T

T

0

n

T

NT

S1; S2; …; Sk

1

0

NT

T

j←j+1

T

1

1

T

T

j < n

NT

i←i+1

2 × m misspredictions

i < m

i←i+1

Baer p. 134

2-Bit Saturating Counter (Example)

for(i=0 ; i < m ; i++)

for(j=0; j<n ; j++)

begin S1; S2; …; Sk end;

m ≤ 0

i ← 0

1-bit

2-bit

n ≥ 0

i

j

Pred

Outc

State

Pred

Outc

0

0

NT

T

wNT

NT

T

j ← 0

0

1

T

T

sT

T

T

0

n

T

NT

sT

T

NT

S1; S2; …; Sk

1

0

NT

T

wT

T

T

j←j+1

T

1

1

T

T

sT

T

T

j < n

NT

i←i+1

i < m

m + 1 misspredictions

i←i+1

Baer p. 134

Accuracy of Branch Prediction
• Includes unconditional branches
• Predictions are associated with branches after each branch’s first execution

Average of 26 traces (IBM 379, DEC PDP-11, CDC 6400)

Average of 32 traces (MIPS R2000, Sun SPARC, DEC VAX, Motorola 68000)

3-bit counters yield only

minor improvements

Fix prediction. Determined by the first execution of the branch.

Baer p. 135

Where to store the Prediction

→ 230 entries

Need one (or two) bit for each

Storing prediction bits

with instructions.

Need to modify

code every 5 instructions.

Many more bits for

tags than for predictions.

Use a cache (Branch

Prediction Buffer – BPB).

Solution: ditch the tags.

Baer p. 136

Pattern History Table (PHT)

Use selected bits from PC

to index (or hash) the PHT.

Each entry of the PHP

stores the state of a

finite state machine

associated with a branch.

Aliasing: multiple branches

may index the same PHT entry.

Baer p. 136

Accuracy of Bimodal Predictor(based on PHT)

Based on 10 SPEC89 traces.

Baer p. 137

Where the Predictor is Stored?

Separate PHT

Embedded in Instruction cache

MIPS R10000:

(512 counters)

Alpha 21264: 1 counter per

instruction? (2K counters)

Sun UltraSPARC:

2 counters/cache line

(2K counters)

IBM PowerPC 620:

(512 counters)

AMD K5:

1 counter/cache line

(1K counters)

Intel Pentium:

Combines PHP

with Branch Target Buffer

(512 entries)

Baer p. 137

Feedback and Recovery

Feedback

Baer p. 137

Feedback: Bimodal Predictor
• Feedback: update 2-bit counter for executing branch
• When the updating is done?
• When the actual direction is found (EX stage)

Other predictions of the same branch are done.

• When the branch commits

Even more predictions are done.

• Speculatively when the prediction is done

Only reinforces prediction in bimodal predictor.

EX/commit updating makes little difference in performance.

Baer p. 137

Textbook typo (p. 137): choice for the timing of the “update”.

Local × Global Predictor
• Local:
• Only use history of the branch to be predicted
• Global:
• Use history of other branches that precede the branch to be predicted.

Baer p. 138

Motivation for Global Prediction
• Example from SPEC program eqntott:

if (aa == 2) /* b1 */

aa = 0;

if (bb == 2) /* b2 */

bb = 0;

if(aa != bb){ /* b3 */

….

}

If b1 and b2 are taken,

then b3 is not taken.

Baer p. 138

Correlator Predictor

Two-level predictor.

History Register

Shifted-out bits

are lost

1 inserted to the

right when a branch

is taken (0 otherwise)

Baer p. 139

Update Problem in theCorrelator Predictor
• PHT is updated non-speculatively at commit stage.
• What is the problem with non-speculative updates of the global register?

Baer p. 139

Updating the Global Register in theCorrelator Predictor

if (aa == 2) /* b1 */

aa = 0;

if (bb == 2) /* b2 */

bb = 0;

if(aa != bb){ /* b3 */

….

}

Branches b1 and b2 are not

include in the prediction of

branch b3!

Baer p. 139

Updating the Global Register in theCorrelator Predictor

Mispredictions and cache misses

affect the commit time of earlier

branches.

• Two consecutive predictions
• of a branch b may use different
• ancestors of b.
• Even if the path leading to
• b is the same

if (aa == 2) /* b1 */

aa = 0;

if (bb == 2) /* b2 */

bb = 0;

if(aa != bb){ /* b3 */

….

}

Baer p. 139

Solution to the Update Problem in theCorrelator Predictor
• Update Global Register speculatively when prediction is made.
• New problem:
• Need a repair mechanism
• All bits after a misprediction are from branches in the wrong path.

Baer p. 139

• Decode Stage:
• Checkpoint current GR into a FIFO queue
• Commit Stage:
• H: head of the queue
• The corresponding check-pointed GR is H.
• Incorrect prediction: shift branch outcome into H and make it the new GR.

Baer p. 144

Optimization to GR Checkpointing

Put into the queue a GR

that has the corrected

bit shifted into it.

Baer p. 144

Issues with Correlator Predictor
• For small PHTs
• Performance is worse than local predictors
• It does not use the location of the branch in the program for the prediction
• May introduce excessive aliasing
• Solution to the aliasing problem:
• Reintroduce the PC in the indexing of PHT

Baer p. 140

gshare Predictor

A common hash is an XOR function.

Baer p. 141

Accuracy and Use of gshare
• Almost perfect for SPEC FP95.
• 0.83 accuracy for SPEC INT95
• 0.65 for program go

Sun UltraSPARC

IBM Power4

AMD K5

Baer p. 141

Example

m ≤ 0

i ← 0

• Assume n=4:
• bimodal mispredicts 1/5 times
• global mispredicts from 0 to 5 times depending on other branches in the loop
• This branch has a fix pattern:
• “4 taken, 1 not taken”
• How can this pattern be learned?
• Remember the history of individual branches
• We need predictors more attuned to locality of individual branches

n ≥ 0

j ← 0

S1; S2; …; Sk

j←j+1

T

j < n

NT

i←i+1

i < m

i←i+1

Baer p. 142

global-set predictor
• First Level: A global shift register for correlations
• Second Level: A set of multiple PHTs to prevent aliasing
• expensive in terms of storage
• must use few PHTs to be viable

Baer p. 142/143

set-global predictor
• Set of Branch History registers (BHT)
• A single global PHT

Baer p. 143

set-set predictor
• A set of branch history registers (BHT)
• A set of PHTs

Baer p. 143

Predicting the Branch Target
• When is the target of a branch computed?
• In a superscalar architecture (p.e., the IA-32 of the Intel P6) after several pipeline stages.
• What is the point of predicting direction early if we don’t know where the branch goes?
• Need to also predict the branch target address.

Baer p. 145

Branch Target Buffer (BTB)
• A cachelike storage that records branch addresses and associated targets
• If there is a hit in BTB for branch predicted taken:
• PC ← Target in BTB for branch

Baer p. 146

Integrated BTB-PHT
• BTB needs much more space than the PHT
• # of entries is limited by BTB.
• BTB must be accessed on a single cycle

Baer p. 146

Decoupled BTB-PHT
• Parallel BTB and PHT access
• if PHT say ‘taken’ and hit in BTB

then PC ← Address in BTB

Baer p. 146

Decoupled BTB-PHT
• For space efficiency:
• Only taken branches are added to BTB
• They are added at the backend when the outcome is known.

IBM PowerPC 620:

256-entry, 2-way

set-associative BTB

2K counter PHT

Baer p. 146

Integrating the BTB with the Branch History Table (BHT)

Most likely, it is not the

same bit field from the PC

that is used to index the BTB+BHT

and to select the PHT

Intel P6

4-bit local history

512 BTB entries

# of PHTs not published

What happens on a BTB miss?

“Backward taken, forward not taken” prediction.

• The history of all branches needs to be recorded in BTB+BHT
• Taken and not taken branches need to be included

Baer p. 147

Two Instances of Mispredictions
• Direction of branch b is mispredicted
• Recovery only when b is at the head of the reorder buffer
• lots of instructions to be nullified
• BTB miss for branch b (direction is correctly predicted taken)
• Cannot fetch instructions until target is computed
• only affect the filling of the front end

Baer p. 147

misfetch
• Branch is correctly predicted taken and
• There is a hit in the BTB
• caused by indirect jumps
• more common in object-oriented languages
• can modify a BTB entry after two misfetches
• need a counter with each BTB entry

Intel Pentium M

Has an indirect branch predictor

associates global history registers

Baer p. 148

CMPUT 229 Flashback:Procedure Call Instructions
• Procedure call: jump and link
• Address of following instruction put in \$ra
• Procedure return: jump register
• Copies \$ra to program counter
• Can also be used for computed jumps
• e.g., for case/switch statements

jal ProcedureLabel

jr \$ra

Chapter 2 — Instructions: Language of the Computer — 53

P-H p. 113

Pat.-Hen. pp. 136-138

and A-26/A-29

int fact ( intn )

{

if (n < 1)

return(1);

else

return(n * fact(n-1));

}

Example fact(3)

Processor

0x1000 4000 jal fact

0x1000 4004 ….

\$v0

\$a0

3

\$t0

\$sp

0x1000 2000

MIPS assembly:

fact:

sub \$sp, \$sp, 8 # Make room in stack for 2 more items

sw \$ra, 4(\$sp) # save the return address

sw \$a0, 0(\$sp) # save the argument n

slt \$t0, \$a0, 1 # if (\$a0<1) then \$t01 else \$t0  0

beq \$t0, \$zero, L1 # if n  1, go to L1

add \$v0, \$zero, 1 # return 1

add \$sp, \$sp, 8 # pop two items from the stack

L1: sub \$a0, \$a0, 1 # subtract 1 from argument

jal fact: # call fact(n-1)

lw \$a0, 0(\$sp) # just returned from jal: restore n

lw \$ra, 4(\$sp) # restore the return address

add \$sp, \$sp, 8 # pop two items from the stack

mul \$v0, \$a0, \$v0 # return n*fact(n-1)

\$ra

Memory

\$sp

Chapter 2 — Instructions: Language of the Computer — 54

Pat.-Hen. pp. 136-138

and A-26/A-29

int fact ( intn )

{

if (n < 1)

return(1);

else

return(n * fact(n-1));

}

Example fact(3)

Processor

0x1000 4000 jal fact

0x1000 4004 ….

\$v0

\$a0

3

\$t0

\$sp

0x1000 2000

MIPS assembly:

fact:

sub \$sp, \$sp, 8 # Make room in stack for 2 more items

sw \$ra, 4(\$sp) # save the return address

sw \$a0, 0(\$sp) # save the argument n

slt \$t0, \$a0, 1 # if (\$a0<1) then \$t01 else \$t0  0

beq \$t0, \$zero, L1 # if n  1, go to L1

add \$v0, \$zero, 1 # return 1

add \$sp, \$sp, 8 # pop two items from the stack

L1: sub \$a0, \$a0, 1 # subtract 1 from argument

jal fact: # call fact(n-1)

lw \$a0, 0(\$sp) # just returned from jal: restore n

lw \$ra, 4(\$sp) # restore the return address

add \$sp, \$sp, 8 # pop two items from the stack

mul \$v0, \$a0, \$v0 # return n*fact(n-1)

\$ra

0x1000 4004

Memory

\$sp

Chapter 2 — Instructions: Language of the Computer — 55

Pat.-Hen. pp. 136-138

and A-26/A-29

\$sp

int fact ( intn )

{

if (n < 1)

return(1);

else

return(n * fact(n-1));

}

Example fact(3)

Processor

0x1000 4000 jal fact

0x1000 4004 ….

\$v0

6

\$a0

3

\$t0

1

\$sp

0x1000 2000

MIPS assembly:

fact:

sub \$sp, \$sp, 8 # Make room in stack for 2 more items

sw \$ra, 4(\$sp) # save the return address

sw \$a0, 0(\$sp) # save the argument n

slt \$t0, \$a0, 1 # if (\$a0<1) then \$t01 else \$t0  0

beq \$t0, \$zero, L1 # if n  1, go to L1

add \$v0, \$zero, 1 # return 1

add \$sp, \$sp, 8 # pop two items from the stack

L1: sub \$a0, \$a0, 1 # subtract 1 from argument

jal fact: # call fact(n-1)

lw \$a0, 0(\$sp) # just returned from jal: restore n

lw \$ra, 4(\$sp) # restore the return address

add \$sp, \$sp, 8 # pop two items from the stack

mul \$v0, \$a0, \$v0 # return n*fact(n-1)

\$ra

0x1000 4004

Memory

0x1000 4004

3

0x1000 6FEC

2

0x1000 6FEC

1

0x1000 6FEC

0

Chapter 2 — Instructions: Language of the Computer — 56

Pat.-Hen. pp. 136-138

and A-26/A-29

\$sp

int fact ( intn )

{

if (n < 1)

return(1);

else

return(n * fact(n-1));

}

Example fact(3)

Processor

0x1000 4000 jal fact

0x1000 4004 ….

\$v0

6

\$a0

3

\$t0

1

\$sp

0x1000 2000

MIPS assembly:

fact:

sub \$sp, \$sp, 8 # Make room in stack for 2 more items

sw \$ra, 4(\$sp) # save the return address

sw \$a0, 0(\$sp) # save the argument n

slt \$t0, \$a0, 1 # if (\$a0<1) then \$t01 else \$t0  0

beq \$t0, \$zero, L1 # if n  1, go to L1

add \$v0, \$zero, 1 # return 1

add \$sp, \$sp, 8 # pop two items from the stack

L1: sub \$a0, \$a0, 1 # subtract 1 from argument

jal fact: # call fact(n-1)

lw \$a0, 0(\$sp) # just returned from jal: restore n

lw \$ra, 4(\$sp) # restore the return address

add \$sp, \$sp, 8 # pop two items from the stack

mul \$v0, \$a0, \$v0 # return n*fact(n-1)

\$ra

0x1000 4004

Memory

0x1000 4004

3

0x1000 6FEC

2

0x1000 6FEC

1

0x1000 6FEC

0

Chapter 2 — Instructions: Language of the Computer — 57

Call/Return Mechanisms

The return address is known since

the time of each call!

foo(….)

{

0x10001000 jal bar

0x10001004 …

0x10001800 jal bar

0x10001804 …

0x10001CE4 jal bar

0x10001CE8 …

...

}

bar(….)

{

0x1000F0E0 jal baz

0x1000F0E4 …

...

jar \$ra

}

baz(….)

{

...

jar \$ra

}

How to predict the next instruction

to be executed after the return?

We know that the branch is always taken.

Baer p. 150

at the function call.

foo(….)

{

0x10001000 jal bar

0x10001004 …

0x10001800 jal bar

0x10001804 …

0x10001CE4 jal bar

0x10001CE8 …

...

}

bar(….)

{

0x1000F0E0 jal baz

0x1000F0E4 …

...

jar \$ra

}

baz(….)

{

...

jar \$ra

}

Pop address from stack at return.

Stack is a circular FIFO. Wrong address on overflow.

What is the best strategy to handle FIFO overflow?

Baer p. 150

Speculative calls and returns

Function calls and returns executed

in the predicted path of a branch

foo(….)

{

0x10000FFC beq … target

0x10001000 jal bar

0x10001004 …

target:

0x10001800 jal baz

0x10001804 …

0x10001CE4 jal bar

0x10001CE8 …

...

}

bar(….)

{

0x1000F0E0 bne … next

0x1000F0E4 jr \$ra

...

next:

….

}

Need a recovery mechanism for the

If a single path is followed, save the

pointer to the top of the stack on a

branch prediction and restore it in

case of misprediction.

Baer p. 150

Return Stacks

MIPS R10000: 1-entry return stack

DEC Alpha 21164:

12-entry return stack

Intel Pentium III: 16-entry return stack

Baer p. 151

A different way of doing things…

Don’t know which way to go?

“Some people go both ways.”

(Scarecrow, The Wizard of Oz)

Baer p. 151

IBM System 360/91
• Upon decoding a branch:
• fetch, decode, and enqueue both the taken and the not taken paths into separate buffers
• Upon branch resolution:
• one buffer becomes the execution path

Baer p. 151

In a restricted version …

Branch is

predicted

taken

There is a

BTB hit

Intel P6

Fall-through instructions in cache line

Instruction Cache Line:

MIPS R10000

Branch Instruction

@#\$&%

misprediction!

Resume Buffer:

Fetch from Resume Buffer!

Baer p. 151

Loop Detector
• A separate loop predictor detects loop patterns:
• TTTTTTTNTTTTTTTNTTTTTTTNTTTTTTTNTT….
• Uses a separate counter for each recognized loop

Intel Pentium M

Baer p. 151

Sophisticated Predictors
• Tension:
• Branch Correlation (global information) × Individual Branch Patterns (local information)
• neutral aliasing
• between branches biased the same way
• destructive aliasing
• between branches with opposite bias
• bias bit
• PHT predicts if direction agrees with the bias bit
• two branches with strong opposite bias that alias do not destroy each other prediction.

Baer p. 152

skewed predictor
• Goal: reduce aliasing
• Use three PHTs
• different hashing function for each PHT
• Take majority vote

Baer p. 153

hybrid (or combining) predictor

Tournament predictor:

predicts which strategy

should be used

Two different prediction strategies

Baer p. 156