Ilp and dynamic execution branch prediction multiple issue
Download
1 / 37

ILP and Dynamic Execution: Branch Prediction & Multiple Issue - PowerPoint PPT Presentation


  • 114 Views
  • Uploaded on
  • Presentation posted in: General

ILP and Dynamic Execution: Branch Prediction & Multiple Issue. Original by Prof. David A. Patterson. Review Tomasulo. Reservations stations: implicit register renaming to larger set of registers + buffering source operands Prevents registers as bottleneck

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha

Download Presentation

ILP and Dynamic Execution: Branch Prediction & Multiple Issue

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


ILP and Dynamic Execution: Branch Prediction & Multiple Issue

Original by

Prof. David A. Patterson


Review Tomasulo

  • Reservations stations: implicit register renaming to larger set of registers + buffering source operands

    • Prevents registers as bottleneck

    • Avoids WAR, WAW hazards of Scoreboard

    • Allows loop unrolling in HW

  • Not limited to basic blocks (integer units gets ahead, beyond branches)

  • Lasting Contributions

    • Dynamic scheduling

    • Register renaming

  • 360/91 descendants are Pentium III; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha 21264


Tomasulo Algorithm and Branch Prediction

  • 360/91 predicted branches, but did not speculate: pipeline stopped until the branch was resolved

    • No speculation; only instructions that can complete

  • Speculation with Reorder Buffer allows execution past branch, and then discard if branch fails

    • just need to hold instructions in buffer until branch can commit


Case for Branch Prediction when Issue N instructions per clock cycle

  • Branches will arrive up to n times faster in an n-issue processor

  • Amdahl’s Law => relative impact of the control stalls will be larger with the lower potential CPI in an n-issue processor


7 Branch Prediction Schemes

  • 1-bit Branch-Prediction Buffer

  • 2-bit Branch-Prediction Buffer

  • Correlating Branch Prediction (two-level) Buffer

  • Tournament Branch Predictor

  • Branch Target Buffer

  • Integrated Instruction Fetch Units

  • Return Address Predictors


Dynamic Branch Prediction

  • Performance = ƒ(accuracy, cost of misprediction)

  • Branch History Table (BHT): Lower bits of PC address index table of 1-bit values

    • Says whether or not branch taken last time

    • No address check (saves HW, but may not be right branch)

  • Problem: in a loop, 1-bit BHT will cause 2 mispredictions (avg is 9 iterations before exit):

    • End of loop case, when it exits instead of looping as before

    • First time through loop on next time through code, when it predicts exit instead of looping

    • Only 80% accuracy even if loop 90% of the time


Dynamic Branch Prediction(Jim Smith, 1981)

  • Solution: 2-bit scheme where change prediction only if get misprediction twice: (Figure 3.7, p. 249)

  • Red: stop, not taken

  • Green: go, taken

  • Adds hysteresis to decision making process

T

NT

Predict Taken

Predict Taken

T

T

NT

NT

Predict Not

Taken

Predict Not

Taken

T

NT


Prediction Accuracy of a 4096-entry 2-bit Prediction Buffer vs. an Infinite Buffer


BHT Accuracy

  • Mispredict because either:

    • Wrong guess for that branch

    • Got branch history of wrong branch when index the table

  • 4096 entry table programs vary from 1% misprediction (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12%

  • For SPEC92,4096 about as good as infinite table


Improve Prediction Strategy By Correlating Branches

  • Consider the worst case for the 2-bit predictor

    if (aa==2) then aa=0;

    if (bb==2) then bb=0;

    if (aa != bb) then whatever

    • single level predictors can never get this case

  • Correlating or 2-level predictors

    • Correlation = what happened on the last branch

      • Note that the last correlator branch may not always be the same

    • Predictor = which way to go

      • 4 possibilities: which way the last one went chooses the prediction

        • (Last-taken, last-not-taken) X (predict-taken, predict-not-taken)

if the first 2 fail then the 3rd will always be taken

From 柯皓仁, 交通大學


Correlating Branches

  • Hypothesis: recently executed branches are correlated; that is, behavior of recently executed branches affects prediction of current branch

  • Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper branch history table

  • In general, (m,n) predictor means record last m branches to select between 2m history tables each with n-bit counters

    • Old 2-bit BHT is then a (0,2) predictor

From 柯皓仁, 交通大學


if (d==0)

d = 1;

if (d==1)

BNEZR1, L1 ;branch b1 (d!=0)

DAAIUR1, R0, #1;d==0, so d=1

L1:DAAIU R3, R1, #-1

BNEZR3, L2;branch b2 (d!=1)

L2:

Example of Correlating Branch Predictors

From 柯皓仁, 交通大學


A Problem for 1-bit Predictor without Correlators

From 柯皓仁, 交通大學


Example of Correlating Branch Predictors (1,1) (Cont.)

From 柯皓仁, 交通大學


In general: (m,n) BHT (prediction buffer)

  • p bits of buffer index = 2p entries of BHT

  • Use last m branches = global branch history

  • Use n bit predictor

  • Total bits for the (m, n) BHT prediction buffer:

From 柯皓仁, 交通大學


(2,2) Predictor Implementation

4 banks = each with 32 2-bit predictor entries

p=5m=2n=2

5:32

From 柯皓仁, 交通大學


Accuracy of Different Schemes


Tournament Predictors

  • Adaptively combine local and global predictors

    • Multiple predictors

      • One based on global information: Results of recently executed m branches

      • One based on local information: Results of past executions of the current branch instruction

    • Selector to choose which predictors to use

  • Advantage

    • Ability to select the right predictor for the right branch

From 柯皓仁, 交通大學


Misprediction Rate Comparison


Branch Target Buffer (BTB)

  • To reduce the branch penalty to 0

    • Need to know what the address is by the end of IF

    • But the instruction is not even decoded yet

    • So use the instruction address rather than wait for decode

      • If prediction works then penalty goes to 0!

  • BTB Idea -- Cache to store taken branches (no need to store untaken)

    • Match tag is instruction address  compare with current PC

    • Data field is the predicted PC

  • May want to add predictor field

    • To avoid the mispredict twice on every loop phenomenon

    • Adds complexity since we now have to track untaken branches as well


Branch PC

Predicted PC

Need Address at Same Time as Prediction

  • Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken)

    • Note: must check for branch match now, since can’t use wrong branch address (Figure 3.19, p. 262)

PC of instruction

FETCH

=?

Extra

prediction state

bits

Yes: instruction is branch and use predicted PC as next PC

No: branch not

predicted, proceed normally

(Next PC = PC+4)


Integrated Instruction Fetch Units

  • Integrated branch prediction

    • The branch predictor becomes part of the instruction fetch unit and is constantly predicting branches

  • Instruction prefetch

    • To deliver multiple instructions per clock, the instruction fetch unit will likely need to fetch ahead

  • Instruction memory access and buffering


Return Address Predictor

  • Indirect jump – jumps whose destination address varies at run time

    • indirect procedure call, select or case, procedure return

    • SPEC89 benchmarks: 85% of indirect jumps are procedure returns

  • Accuracy of BTB for procedure returns are low

    • if procedure is called from many places, and the calls from one place are not clustered in time

  • Use a small buffer of return addresses operating as a stack

    • Cache the most recent return addresses

    • Push a return address at a call, and pop one off at a return

    • If the cache is sufficient large (max call depth)  prefect

From 柯皓仁, 交通大學


Dynamic Branch Prediction Summary

  • Branch History Table: 2 bits for loop accuracy

  • Correlation: Recently executed branches correlated with next branch.

    • Either different branches

    • Or different executions of same branches

  • Tournament Predictor: more resources to competitive solutions and pick between them

  • Branch Target Buffer: include branch address & prediction

  • Return address stack for prediction of indirect jump


Getting CPI < 1: Issuing Multiple Instructions/Cycle

  • Vector Processing

    • Explicit coding of independent loops as operations on large vectors of numbers

    • Multimedia instructions being added to many processors

  • Superscalar

    • varying number instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo)

    • IBM PowerPC, Sun UltraSparc, DEC Alpha, Pentium III/4

  • (Very) Long Instruction Words (V)LIW:

    • fixed number of instructions (4-16) scheduled by the compiler; put operations into wide templates

    • Intel Architecture-64 (IA-64) 64-bit address

      • Renamed: “Explicitly Parallel Instruction Computer (EPIC)”

  • Anticipated success of multiple instructions lead to Instructions Per Clock cycle (IPC) vs. CPI


Getting CPI < 1: IssuingMultiple Instructions/Cycle

  • Superscalar MIPS: 2 instructions, 1 FP & 1 anything

    – Fetch 64-bits/clock cycle; Int on left, FP on right

    TypePipeStages

    Int. instructionIFIDEXMEMWB

    FP ADD instructionIFIDEXEXEXWB

    Int. instructionIFIDEXMEMWB

    FP ADD instructionIFIDEXEXEXWB

    Int. instructionIFIDEXMEMWB

    FP ADD instructionIFIDEXEXEXWB

  • While Integer/FP split is simple for the HW, get CPI of 0.5 only for programs with:

    • Exactly 50% FP operations AND No hazards


Multiple Issue Issues

  • Issue packet: group of instructions from fetch unit that could potentially issue in 1 clock

    • If instruction causes structural hazard or a data hazard either due to earlier instruction in execution or to earlier instruction in issue packet, then instruction does not issue

    • 0 to N instruction issues per clock cycle, for N-issue

  • Performing issue checks in 1 cycle could limit clock cycle time

    • Issue stage usually split and pipelined

    • 1st stage decides how many instructions from within this packet can issue, 2nd stage examines hazards among selected instructions and those already been issued

    • higher branch penalties => prediction accuracy important


Studies of The Limitations of ILP

  • Perfect hardware model

    • Rename as much as you need

      • Infinite registers

      • No WAW or WAR

    • Branch prediction is perfect

      • Never happen in reality

    • Jump prediction (even computed such as return) is perfect

      • Similarly unreal

    • Perfect memory disambiguation

      • Almost perfect is not too hard in practice

    • Issue an unlimited number of instructions at once & no restriction on types of instructions issued

    • One-cycle latency


What A Perfect Processor Must Do?

  • Look arbitrary far ahead to find a set of instructions to issue, predicting all branches perfectly

  • Rename all register uses to avoid WAW and WAR hazards

  • Determine whether there are any dependences among the instructions in the issue packet; if so, rename accordingly

  • Determine if any memory dependences exist among the issuing instructions and hand them appropriately

  • Provide enough replicated function units to allow all the ready instructions to issue


How many instructions would issue on the perfect machine every cycle?


Window Size

  • The set of instructions that is examined for simultaneous execution is called the window

  • The window size will be determined by the cost of determining whether n issuing instructions have any register dependences among them

    • In theory, This cost will be O(n2), refer to p.243, why?

  • Each instruction in the window must be kept in processor

  • Window size is limited by the required storage, the comparisons, and a limited issue rate


Effects of Reducing the Window Size


Effects of Realistic Branch and Jump Prediction

  • Perfect

  • Tournament-based

    • Uses a correlating 2 bit and non-correlating 2 bit plus a selector to choose between the two

    • Prediction buffer has 8K (13 address bits from the branch)

    • 3 entries per slot - non-correlating, correlating, select

  • Standard 2 bit

    • 512 (9 address bits) entries

    • Plus 16 entry buffer to predict RETURNS

  • Static

    • Based on profile - predict either T or NT but it stays fixed

  • None


The Effect of Branch-Prediction Schemes


Effects of Finite Registers


Models for Memory Alias Analysis

  • Perfect

  • Global/Stack Perfect

    • Perfect prediction for global and stack references and assume all heap references conflict

    • The best compiler-based analysis schemes can do

  • Inspection

    • Pointers to different allocation areas (such as the global area and the stack area) are assumed no conflict

    • Addresses using same register with different offsets

  • None

    • All memory references are assumed to conflict


Effects of Memory Aliasing


ad
  • Login