ILP and Dynamic Execution: Branch Prediction & Multiple Issue - PowerPoint PPT Presentation

Ilp and dynamic execution branch prediction multiple issue
1 / 37

  • Uploaded on
  • Presentation posted in: General

ILP and Dynamic Execution: Branch Prediction & Multiple Issue. Original by Prof. David A. Patterson. Review Tomasulo. Reservations stations: implicit register renaming to larger set of registers + buffering source operands Prevents registers as bottleneck

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

ILP and Dynamic Execution: Branch Prediction & Multiple Issue

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Ilp and dynamic execution branch prediction multiple issue

ILP and Dynamic Execution: Branch Prediction & Multiple Issue

Original by

Prof. David A. Patterson

Review tomasulo

Review Tomasulo

  • Reservations stations: implicit register renaming to larger set of registers + buffering source operands

    • Prevents registers as bottleneck

    • Avoids WAR, WAW hazards of Scoreboard

    • Allows loop unrolling in HW

  • Not limited to basic blocks (integer units gets ahead, beyond branches)

  • Lasting Contributions

    • Dynamic scheduling

    • Register renaming

  • 360/91 descendants are Pentium III; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha 21264

Tomasulo algorithm and branch prediction

Tomasulo Algorithm and Branch Prediction

  • 360/91 predicted branches, but did not speculate: pipeline stopped until the branch was resolved

    • No speculation; only instructions that can complete

  • Speculation with Reorder Buffer allows execution past branch, and then discard if branch fails

    • just need to hold instructions in buffer until branch can commit

Case for branch prediction when issue n instructions per clock cycle

Case for Branch Prediction when Issue N instructions per clock cycle

  • Branches will arrive up to n times faster in an n-issue processor

  • Amdahl’s Law => relative impact of the control stalls will be larger with the lower potential CPI in an n-issue processor

7 branch prediction schemes

7 Branch Prediction Schemes

  • 1-bit Branch-Prediction Buffer

  • 2-bit Branch-Prediction Buffer

  • Correlating Branch Prediction (two-level) Buffer

  • Tournament Branch Predictor

  • Branch Target Buffer

  • Integrated Instruction Fetch Units

  • Return Address Predictors

Dynamic branch prediction

Dynamic Branch Prediction

  • Performance = ƒ(accuracy, cost of misprediction)

  • Branch History Table (BHT): Lower bits of PC address index table of 1-bit values

    • Says whether or not branch taken last time

    • No address check (saves HW, but may not be right branch)

  • Problem: in a loop, 1-bit BHT will cause 2 mispredictions (avg is 9 iterations before exit):

    • End of loop case, when it exits instead of looping as before

    • First time through loop on next time through code, when it predicts exit instead of looping

    • Only 80% accuracy even if loop 90% of the time

Dynamic branch prediction jim smith 1981

Dynamic Branch Prediction(Jim Smith, 1981)

  • Solution: 2-bit scheme where change prediction only if get misprediction twice: (Figure 3.7, p. 249)

  • Red: stop, not taken

  • Green: go, taken

  • Adds hysteresis to decision making process



Predict Taken

Predict Taken





Predict Not


Predict Not




Prediction accuracy of a 4096 entry 2 bit prediction buffer vs an infinite buffer

Prediction Accuracy of a 4096-entry 2-bit Prediction Buffer vs. an Infinite Buffer

Bht accuracy

BHT Accuracy

  • Mispredict because either:

    • Wrong guess for that branch

    • Got branch history of wrong branch when index the table

  • 4096 entry table programs vary from 1% misprediction (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12%

  • For SPEC92,4096 about as good as infinite table

Improve prediction strategy by correlating branches

Improve Prediction Strategy By Correlating Branches

  • Consider the worst case for the 2-bit predictor

    if (aa==2) then aa=0;

    if (bb==2) then bb=0;

    if (aa != bb) then whatever

    • single level predictors can never get this case

  • Correlating or 2-level predictors

    • Correlation = what happened on the last branch

      • Note that the last correlator branch may not always be the same

    • Predictor = which way to go

      • 4 possibilities: which way the last one went chooses the prediction

        • (Last-taken, last-not-taken) X (predict-taken, predict-not-taken)

if the first 2 fail then the 3rd will always be taken

From 柯皓仁, 交通大學

Correlating branches

Correlating Branches

  • Hypothesis: recently executed branches are correlated; that is, behavior of recently executed branches affects prediction of current branch

  • Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper branch history table

  • In general, (m,n) predictor means record last m branches to select between 2m history tables each with n-bit counters

    • Old 2-bit BHT is then a (0,2) predictor

From 柯皓仁, 交通大學

Example of correlating branch predictors

if (d==0)

d = 1;

if (d==1)

BNEZR1, L1 ;branch b1 (d!=0)

DAAIUR1, R0, #1;d==0, so d=1

L1:DAAIU R3, R1, #-1

BNEZR3, L2;branch b2 (d!=1)


Example of Correlating Branch Predictors

From 柯皓仁, 交通大學

A problem for 1 bit predictor without correlators

A Problem for 1-bit Predictor without Correlators

From 柯皓仁, 交通大學

Example of correlating branch predictors 1 1 cont

Example of Correlating Branch Predictors (1,1) (Cont.)

From 柯皓仁, 交通大學

In general m n bht prediction buffer

In general: (m,n) BHT (prediction buffer)

  • p bits of buffer index = 2p entries of BHT

  • Use last m branches = global branch history

  • Use n bit predictor

  • Total bits for the (m, n) BHT prediction buffer:

From 柯皓仁, 交通大學

2 2 predictor implementation

(2,2) Predictor Implementation

4 banks = each with 32 2-bit predictor entries



From 柯皓仁, 交通大學

Accuracy of different schemes

Accuracy of Different Schemes

Tournament predictors

Tournament Predictors

  • Adaptively combine local and global predictors

    • Multiple predictors

      • One based on global information: Results of recently executed m branches

      • One based on local information: Results of past executions of the current branch instruction

    • Selector to choose which predictors to use

  • Advantage

    • Ability to select the right predictor for the right branch

From 柯皓仁, 交通大學

Misprediction rate comparison

Misprediction Rate Comparison

Branch target buffer btb

Branch Target Buffer (BTB)

  • To reduce the branch penalty to 0

    • Need to know what the address is by the end of IF

    • But the instruction is not even decoded yet

    • So use the instruction address rather than wait for decode

      • If prediction works then penalty goes to 0!

  • BTB Idea -- Cache to store taken branches (no need to store untaken)

    • Match tag is instruction address  compare with current PC

    • Data field is the predicted PC

  • May want to add predictor field

    • To avoid the mispredict twice on every loop phenomenon

    • Adds complexity since we now have to track untaken branches as well

Need address at same time as prediction

Branch PC

Predicted PC

Need Address at Same Time as Prediction

  • Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken)

    • Note: must check for branch match now, since can’t use wrong branch address (Figure 3.19, p. 262)

PC of instruction




prediction state


Yes: instruction is branch and use predicted PC as next PC

No: branch not

predicted, proceed normally

(Next PC = PC+4)

Integrated instruction fetch units

Integrated Instruction Fetch Units

  • Integrated branch prediction

    • The branch predictor becomes part of the instruction fetch unit and is constantly predicting branches

  • Instruction prefetch

    • To deliver multiple instructions per clock, the instruction fetch unit will likely need to fetch ahead

  • Instruction memory access and buffering

Return address predictor

Return Address Predictor

  • Indirect jump – jumps whose destination address varies at run time

    • indirect procedure call, select or case, procedure return

    • SPEC89 benchmarks: 85% of indirect jumps are procedure returns

  • Accuracy of BTB for procedure returns are low

    • if procedure is called from many places, and the calls from one place are not clustered in time

  • Use a small buffer of return addresses operating as a stack

    • Cache the most recent return addresses

    • Push a return address at a call, and pop one off at a return

    • If the cache is sufficient large (max call depth)  prefect

From 柯皓仁, 交通大學

Dynamic branch prediction summary

Dynamic Branch Prediction Summary

  • Branch History Table: 2 bits for loop accuracy

  • Correlation: Recently executed branches correlated with next branch.

    • Either different branches

    • Or different executions of same branches

  • Tournament Predictor: more resources to competitive solutions and pick between them

  • Branch Target Buffer: include branch address & prediction

  • Return address stack for prediction of indirect jump

Getting cpi 1 issuing multiple instructions cycle

Getting CPI < 1: Issuing Multiple Instructions/Cycle

  • Vector Processing

    • Explicit coding of independent loops as operations on large vectors of numbers

    • Multimedia instructions being added to many processors

  • Superscalar

    • varying number instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo)

    • IBM PowerPC, Sun UltraSparc, DEC Alpha, Pentium III/4

  • (Very) Long Instruction Words (V)LIW:

    • fixed number of instructions (4-16) scheduled by the compiler; put operations into wide templates

    • Intel Architecture-64 (IA-64) 64-bit address

      • Renamed: “Explicitly Parallel Instruction Computer (EPIC)”

  • Anticipated success of multiple instructions lead to Instructions Per Clock cycle (IPC) vs. CPI

Getting cpi 1 issuing multiple instructions cycle1

Getting CPI < 1: IssuingMultiple Instructions/Cycle

  • Superscalar MIPS: 2 instructions, 1 FP & 1 anything

    – Fetch 64-bits/clock cycle; Int on left, FP on right


    Int. instructionIFIDEXMEMWB

    FP ADD instructionIFIDEXEXEXWB

    Int. instructionIFIDEXMEMWB

    FP ADD instructionIFIDEXEXEXWB

    Int. instructionIFIDEXMEMWB

    FP ADD instructionIFIDEXEXEXWB

  • While Integer/FP split is simple for the HW, get CPI of 0.5 only for programs with:

    • Exactly 50% FP operations AND No hazards

Multiple issue issues

Multiple Issue Issues

  • Issue packet: group of instructions from fetch unit that could potentially issue in 1 clock

    • If instruction causes structural hazard or a data hazard either due to earlier instruction in execution or to earlier instruction in issue packet, then instruction does not issue

    • 0 to N instruction issues per clock cycle, for N-issue

  • Performing issue checks in 1 cycle could limit clock cycle time

    • Issue stage usually split and pipelined

    • 1st stage decides how many instructions from within this packet can issue, 2nd stage examines hazards among selected instructions and those already been issued

    • higher branch penalties => prediction accuracy important

Studies of the limitations of ilp

Studies of The Limitations of ILP

  • Perfect hardware model

    • Rename as much as you need

      • Infinite registers

      • No WAW or WAR

    • Branch prediction is perfect

      • Never happen in reality

    • Jump prediction (even computed such as return) is perfect

      • Similarly unreal

    • Perfect memory disambiguation

      • Almost perfect is not too hard in practice

    • Issue an unlimited number of instructions at once & no restriction on types of instructions issued

    • One-cycle latency

What a perfect processor must do

What A Perfect Processor Must Do?

  • Look arbitrary far ahead to find a set of instructions to issue, predicting all branches perfectly

  • Rename all register uses to avoid WAW and WAR hazards

  • Determine whether there are any dependences among the instructions in the issue packet; if so, rename accordingly

  • Determine if any memory dependences exist among the issuing instructions and hand them appropriately

  • Provide enough replicated function units to allow all the ready instructions to issue

How many instructions would issue on the perfect machine every cycle

How many instructions would issue on the perfect machine every cycle?

Window size

Window Size

  • The set of instructions that is examined for simultaneous execution is called the window

  • The window size will be determined by the cost of determining whether n issuing instructions have any register dependences among them

    • In theory, This cost will be O(n2), refer to p.243, why?

  • Each instruction in the window must be kept in processor

  • Window size is limited by the required storage, the comparisons, and a limited issue rate

Effects of reducing the window size

Effects of Reducing the Window Size

Effects of realistic branch and jump prediction

Effects of Realistic Branch and Jump Prediction

  • Perfect

  • Tournament-based

    • Uses a correlating 2 bit and non-correlating 2 bit plus a selector to choose between the two

    • Prediction buffer has 8K (13 address bits from the branch)

    • 3 entries per slot - non-correlating, correlating, select

  • Standard 2 bit

    • 512 (9 address bits) entries

    • Plus 16 entry buffer to predict RETURNS

  • Static

    • Based on profile - predict either T or NT but it stays fixed

  • None

The effect of branch prediction schemes

The Effect of Branch-Prediction Schemes

Effects of finite registers

Effects of Finite Registers

Models for memory alias analysis

Models for Memory Alias Analysis

  • Perfect

  • Global/Stack Perfect

    • Perfect prediction for global and stack references and assume all heap references conflict

    • The best compiler-based analysis schemes can do

  • Inspection

    • Pointers to different allocation areas (such as the global area and the stack area) are assumed no conflict

    • Addresses using same register with different offsets

  • None

    • All memory references are assumed to conflict

Effects of memory aliasing

Effects of Memory Aliasing

  • Login