Ilp and dynamic execution branch prediction multiple issue
1 / 37

ILP and Dynamic Execution: Branch Prediction & Multiple Issue - PowerPoint PPT Presentation

  • Uploaded on

ILP and Dynamic Execution: Branch Prediction & Multiple Issue. Original by Prof. David A. Patterson. Review Tomasulo. Reservations stations: implicit register renaming to larger set of registers + buffering source operands Prevents registers as bottleneck

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' ILP and Dynamic Execution: Branch Prediction & Multiple Issue' - xiu

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Ilp and dynamic execution branch prediction multiple issue

ILP and Dynamic Execution: Branch Prediction & Multiple Issue

Original by

Prof. David A. Patterson

Review tomasulo
Review Tomasulo Issue

  • Reservations stations: implicit register renaming to larger set of registers + buffering source operands

    • Prevents registers as bottleneck

    • Avoids WAR, WAW hazards of Scoreboard

    • Allows loop unrolling in HW

  • Not limited to basic blocks (integer units gets ahead, beyond branches)

  • Lasting Contributions

    • Dynamic scheduling

    • Register renaming

  • 360/91 descendants are Pentium III; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha 21264

Tomasulo algorithm and branch prediction
Tomasulo Algorithm and Branch Prediction Issue

  • 360/91 predicted branches, but did not speculate: pipeline stopped until the branch was resolved

    • No speculation; only instructions that can complete

  • Speculation with Reorder Buffer allows execution past branch, and then discard if branch fails

    • just need to hold instructions in buffer until branch can commit

Case for branch prediction when issue n instructions per clock cycle
Case for Branch Prediction when IssueIssue N instructions per clock cycle

  • Branches will arrive up to n times faster in an n-issue processor

  • Amdahl’s Law => relative impact of the control stalls will be larger with the lower potential CPI in an n-issue processor

7 branch prediction schemes
7 Branch Prediction Schemes Issue

  • 1-bit Branch-Prediction Buffer

  • 2-bit Branch-Prediction Buffer

  • Correlating Branch Prediction (two-level) Buffer

  • Tournament Branch Predictor

  • Branch Target Buffer

  • Integrated Instruction Fetch Units

  • Return Address Predictors

Dynamic branch prediction
Dynamic Branch Prediction Issue

  • Performance = ƒ(accuracy, cost of misprediction)

  • Branch History Table (BHT): Lower bits of PC address index table of 1-bit values

    • Says whether or not branch taken last time

    • No address check (saves HW, but may not be right branch)

  • Problem: in a loop, 1-bit BHT will cause 2 mispredictions (avg is 9 iterations before exit):

    • End of loop case, when it exits instead of looping as before

    • First time through loop on next time through code, when it predicts exit instead of looping

    • Only 80% accuracy even if loop 90% of the time

Dynamic branch prediction jim smith 1981
Dynamic Branch Prediction Issue(Jim Smith, 1981)

  • Solution: 2-bit scheme where change prediction only if get misprediction twice: (Figure 3.7, p. 249)

  • Red: stop, not taken

  • Green: go, taken

  • Adds hysteresis to decision making process



Predict Taken

Predict Taken





Predict Not


Predict Not




Bht accuracy
BHT Accuracy vs. an Infinite Buffer

  • Mispredict because either:

    • Wrong guess for that branch

    • Got branch history of wrong branch when index the table

  • 4096 entry table programs vary from 1% misprediction (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12%

  • For SPEC92,4096 about as good as infinite table

Improve prediction strategy by correlating branches
Improve Prediction Strategy By Correlating Branches vs. an Infinite Buffer

  • Consider the worst case for the 2-bit predictor

    if (aa==2) then aa=0;

    if (bb==2) then bb=0;

    if (aa != bb) then whatever

    • single level predictors can never get this case

  • Correlating or 2-level predictors

    • Correlation = what happened on the last branch

      • Note that the last correlator branch may not always be the same

    • Predictor = which way to go

      • 4 possibilities: which way the last one went chooses the prediction

        • (Last-taken, last-not-taken) X (predict-taken, predict-not-taken)

if the first 2 fail then the 3rd will always be taken

From 柯皓仁, 交通大學

Correlating branches
Correlating Branches vs. an Infinite Buffer

  • Hypothesis: recently executed branches are correlated; that is, behavior of recently executed branches affects prediction of current branch

  • Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper branch history table

  • In general, (m,n) predictor means record last m branches to select between 2m history tables each with n-bit counters

    • Old 2-bit BHT is then a (0,2) predictor

From 柯皓仁, 交通大學

Example of correlating branch predictors

if (d==0) vs. an Infinite Buffer

d = 1;

if (d==1)

BNEZ R1, L1 ;branch b1 (d!=0)

DAAIU R1, R0, #1 ;d==0, so d=1

L1: DAAIU R3, R1, #-1

BNEZ R3, L2 ;branch b2 (d!=1)


Example of Correlating Branch Predictors

From 柯皓仁, 交通大學

A problem for 1 bit predictor without correlators
A Problem for 1-bit Predictor without Correlators vs. an Infinite Buffer

From 柯皓仁, 交通大學

Example of correlating branch predictors 1 1 cont
Example of Correlating Branch Predictors (1,1) (Cont.) vs. an Infinite Buffer

From 柯皓仁, 交通大學

In general m n bht prediction buffer
In general: (m,n) BHT (prediction buffer) vs. an Infinite Buffer

  • p bits of buffer index = 2p entries of BHT

  • Use last m branches = global branch history

  • Use n bit predictor

  • Total bits for the (m, n) BHT prediction buffer:

From 柯皓仁, 交通大學

2 2 predictor implementation
(2,2) vs. an Infinite BufferPredictor Implementation

4 banks = each with 32 2-bit predictor entries



From 柯皓仁, 交通大學

Accuracy of different schemes
Accuracy of Different Schemes vs. an Infinite Buffer

Tournament predictors
Tournament Predictors vs. an Infinite Buffer

  • Adaptively combine local and global predictors

    • Multiple predictors

      • One based on global information: Results of recently executed m branches

      • One based on local information: Results of past executions of the current branch instruction

    • Selector to choose which predictors to use

  • Advantage

    • Ability to select the right predictor for the right branch

From 柯皓仁, 交通大學

Misprediction rate comparison
Misprediction Rate Comparison vs. an Infinite Buffer

Branch target buffer btb
Branch Target Buffer (BTB) vs. an Infinite Buffer

  • To reduce the branch penalty to 0

    • Need to know what the address is by the end of IF

    • But the instruction is not even decoded yet

    • So use the instruction address rather than wait for decode

      • If prediction works then penalty goes to 0!

  • BTB Idea -- Cache to store taken branches (no need to store untaken)

    • Match tag is instruction address  compare with current PC

    • Data field is the predicted PC

  • May want to add predictor field

    • To avoid the mispredict twice on every loop phenomenon

    • Adds complexity since we now have to track untaken branches as well

Need address at same time as prediction

Branch PC vs. an Infinite Buffer

Predicted PC

Need Address at Same Time as Prediction

  • Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken)

    • Note: must check for branch match now, since can’t use wrong branch address (Figure 3.19, p. 262)

PC of instruction




prediction state


Yes: instruction is branch and use predicted PC as next PC

No: branch not

predicted, proceed normally

(Next PC = PC+4)

Integrated instruction fetch units
Integrated Instruction Fetch Units vs. an Infinite Buffer

  • Integrated branch prediction

    • The branch predictor becomes part of the instruction fetch unit and is constantly predicting branches

  • Instruction prefetch

    • To deliver multiple instructions per clock, the instruction fetch unit will likely need to fetch ahead

  • Instruction memory access and buffering

Return address predictor
Return Address Predictor vs. an Infinite Buffer

  • Indirect jump – jumps whose destination address varies at run time

    • indirect procedure call, select or case, procedure return

    • SPEC89 benchmarks: 85% of indirect jumps are procedure returns

  • Accuracy of BTB for procedure returns are low

    • if procedure is called from many places, and the calls from one place are not clustered in time

  • Use a small buffer of return addresses operating as a stack

    • Cache the most recent return addresses

    • Push a return address at a call, and pop one off at a return

    • If the cache is sufficient large (max call depth)  prefect

From 柯皓仁, 交通大學

Dynamic branch prediction summary
Dynamic Branch Prediction Summary vs. an Infinite Buffer

  • Branch History Table: 2 bits for loop accuracy

  • Correlation: Recently executed branches correlated with next branch.

    • Either different branches

    • Or different executions of same branches

  • Tournament Predictor: more resources to competitive solutions and pick between them

  • Branch Target Buffer: include branch address & prediction

  • Return address stack for prediction of indirect jump

Getting cpi 1 issuing multiple instructions cycle
Getting CPI < 1: vs. an Infinite BufferIssuing Multiple Instructions/Cycle

  • Vector Processing

    • Explicit coding of independent loops as operations on large vectors of numbers

    • Multimedia instructions being added to many processors

  • Superscalar

    • varying number instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo)

    • IBM PowerPC, Sun UltraSparc, DEC Alpha, Pentium III/4

  • (Very) Long Instruction Words (V)LIW:

    • fixed number of instructions (4-16) scheduled by the compiler; put operations into wide templates

    • Intel Architecture-64 (IA-64) 64-bit address

      • Renamed: “Explicitly Parallel Instruction Computer (EPIC)”

  • Anticipated success of multiple instructions lead to Instructions Per Clock cycle (IPC) vs. CPI

Getting cpi 1 issuing multiple instructions cycle1
Getting CPI < 1: Issuing vs. an Infinite BufferMultiple Instructions/Cycle

  • Superscalar MIPS: 2 instructions, 1 FP & 1 anything

    – Fetch 64-bits/clock cycle; Int on left, FP on right

    Type Pipe Stages

    Int. instruction IF ID EX MEM WB

    FP ADD instruction IF ID EX EX EX WB

    Int. instruction IF ID EX MEM WB

    FP ADD instruction IF ID EX EX EX WB

    Int. instruction IF ID EX MEM WB

    FP ADD instruction IF ID EX EX EX WB

  • While Integer/FP split is simple for the HW, get CPI of 0.5 only for programs with:

    • Exactly 50% FP operations AND No hazards

Multiple issue issues
Multiple Issue Issues vs. an Infinite Buffer

  • Issue packet: group of instructions from fetch unit that could potentially issue in 1 clock

    • If instruction causes structural hazard or a data hazard either due to earlier instruction in execution or to earlier instruction in issue packet, then instruction does not issue

    • 0 to N instruction issues per clock cycle, for N-issue

  • Performing issue checks in 1 cycle could limit clock cycle time

    • Issue stage usually split and pipelined

    • 1st stage decides how many instructions from within this packet can issue, 2nd stage examines hazards among selected instructions and those already been issued

    • higher branch penalties => prediction accuracy important

Studies of the limitations of ilp
Studies of The Limitations of ILP vs. an Infinite Buffer

  • Perfect hardware model

    • Rename as much as you need

      • Infinite registers

      • No WAW or WAR

    • Branch prediction is perfect

      • Never happen in reality

    • Jump prediction (even computed such as return) is perfect

      • Similarly unreal

    • Perfect memory disambiguation

      • Almost perfect is not too hard in practice

    • Issue an unlimited number of instructions at once & no restriction on types of instructions issued

    • One-cycle latency

What a perfect processor must do
What A Perfect Processor Must Do? vs. an Infinite Buffer

  • Look arbitrary far ahead to find a set of instructions to issue, predicting all branches perfectly

  • Rename all register uses to avoid WAW and WAR hazards

  • Determine whether there are any dependences among the instructions in the issue packet; if so, rename accordingly

  • Determine if any memory dependences exist among the issuing instructions and hand them appropriately

  • Provide enough replicated function units to allow all the ready instructions to issue

Window size
Window Size every cycle?

  • The set of instructions that is examined for simultaneous execution is called the window

  • The window size will be determined by the cost of determining whether n issuing instructions have any register dependences among them

    • In theory, This cost will be O(n2), refer to p.243, why?

  • Each instruction in the window must be kept in processor

  • Window size is limited by the required storage, the comparisons, and a limited issue rate

Effects of realistic branch and jump prediction
Effects of Realistic Branch and Jump Prediction every cycle?

  • Perfect

  • Tournament-based

    • Uses a correlating 2 bit and non-correlating 2 bit plus a selector to choose between the two

    • Prediction buffer has 8K (13 address bits from the branch)

    • 3 entries per slot - non-correlating, correlating, select

  • Standard 2 bit

    • 512 (9 address bits) entries

    • Plus 16 entry buffer to predict RETURNS

  • Static

    • Based on profile - predict either T or NT but it stays fixed

  • None

Models for memory alias analysis
Models for Memory Alias Analysis every cycle?

  • Perfect

  • Global/Stack Perfect

    • Perfect prediction for global and stack references and assume all heap references conflict

    • The best compiler-based analysis schemes can do

  • Inspection

    • Pointers to different allocation areas (such as the global area and the stack area) are assumed no conflict

    • Addresses using same register with different offsets

  • None

    • All memory references are assumed to conflict