Advanced computer architecture 5md00 5z033 ilp architectures
This presentation is the property of its rightful owner.
Sponsored Links
1 / 55

Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures PowerPoint PPT Presentation


  • 100 Views
  • Uploaded on
  • Presentation posted in: General

Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures. Henk Corporaal www.ics.ele.tue.nl/~heco/courses/aca [email protected] TUEindhoven 2007. Topics. Introduction Hazards Data dependences Control dependences Branch prediction Dependences limit ILP: scheduling

Download Presentation

Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Advanced computer architecture 5md00 5z033 ilp architectures

Advanced Computer Architecture5MD00 / 5Z033ILP architectures

Henk Corporaal

www.ics.ele.tue.nl/~heco/courses/aca

[email protected]

TUEindhoven

2007


Topics

Topics

  • Introduction

  • Hazards

    • Data dependences

    • Control dependences

  • Branch prediction

  • Dependences limit ILP: scheduling

  • Out-Of-Order execution: Hardware speculation

  • Multiple issue

  • How much ILP is there?

ACA H.Corporaal


Introduction

Introduction

ILP = Instruction level parallelism

  • multiple operations (or instructions) can be executed in parallel

    Needed:

  • Sufficient resources

  • Parallel scheduling

    • Hardware solution

    • Software solution

  • Application should contain ILP

ACA H.Corporaal


Hazards

Hazards

  • Three types of hazards

    • Structural

    • Data dependence

    • Control dependence

  • Hazards cause scheduling problems

ACA H.Corporaal


Data dependences

Data dependences

  • RaWread after write

  • WaRwrite after read

  • WaWwrite after write

ACA H.Corporaal


Control dependences

Control Dependences

C input code:

if (a > b) { r = a % b; }

else { r = b % a; }

y = a*b;

1

sub t1, a, b

bgz t1, 2, 3

CFG:

2

rem r, a, b

goto 4

3

rem r, b, a

goto 4

4

mul y,a,b

…………..

How real are control dependences?

ACA H.Corporaal


Branch prediction

Branch Prediction

ACA H.Corporaal


Branch prediction motivation

Branch PredictionMotivation

  • High branch penalties in pipelined processors:

  • With on average one out of five instructions being a branch, the maximum ILP is five

  • Situation even worse for multiple-issue processors, because we need to provide an instruction stream of n instructions per cycle.

  • Idea: predict the outcome of branches based on their history and execute instructions speculatively

ACA H.Corporaal


5 branch prediction schemes

5 Branch Prediction Schemes

  • 1-bit Branch Prediction Buffer

  • 2-bit Branch Prediction Buffer

  • Correlating Branch Prediction Buffer

  • Branch Target Buffer

  • Return Address Predictors

    + A way to get rid of those malicious branches

ACA H.Corporaal


1 bit branch prediction buffer

1-bit Branch Prediction Buffer

  • 1-bit branch prediction buffer or branch history table:

  • Buffer is like a cache without tags

  • Does not help for simple MIPS pipeline because target address calculations in same stage as branch condition calculation

PC

10…..10 101 00

BHT

0

1

0

1

0

1

1

0

ACA H.Corporaal


Branch prediction buffer 1 bit prediction

Branch Prediction Buffer: 1 bit prediction

Branch address

2 K entries

(K bits)

predictionbit

  • Problems:

  • Aliasing: lower K bits of different branch instructions could be the same

    • Soln: Use tags (the buffer becomes a tag); however very expensive

  • Loops are predicted wrong twice

    • Soln: Use n-bit saturation counter prediction

      • taken if counter  2 (n-1)

      • not-taken if counter < 2 (n-1)

    • A 2 bit saturating counter predicts a loop wrong only once

ACA H.Corporaal


2 bit branch prediction buffer

T

NT

Predict Taken

Predict Taken

T

T

NT

NT

Predict Not

Taken

Predict Not

Taken

T

NT

2-bit Branch Prediction Buffer

  • Solution: 2-bit scheme where prediction is changed only if mispredicted twice

  • Can be implemented as a saturating counter:

ACA H.Corporaal


Correlating branches

Correlating Branches

  • Fragment from SPEC92 benchmark eqntott:

if (aa==2)

aa = 0;

if (bb==2)

bb=0;

if (aa!=bb){..}

subi R3,R1,#2

b1:bnez R3,L1

add R1,R0,R0

L1:subi R3,R2,#2

b2:bnez R3,L2

add R2,R0,R0

L2:sub R3,R1,R2

b3:beqz R3,L3

ACA H.Corporaal


Correlating branch predictor

Correlating Branch Predictor

Idea: behavior of current branch is related to taken/not taken history of recently executed branches

  • Then behavior of recent branches selects between, say, 4 predictions of next branch, updating just that prediction

  • (2,2) predictor: 2-bit global, 2-bit local

  • (k,n) predictor uses behavior of last k branches to choose from 2k predictors, each of which is n-bit predictor

    • 4 bits from branch address

    2-bits per branch local predictors

    Prediction

    shift

    register

    2-bit global

    branch history

    (01 = not taken, then taken)

    ACA H.Corporaal


    Branch correlation using branch history

    Branch Correlation Using Branch History

    • Two schemes (a, k, m, n)

      • PA: Per address history, a > 0

      • GA: Global history, a = 0

    Pattern History Table

    2m-1

    n-bit

    saturating

    Up/Down

    Counter

    m

    1

    Prediction

    Branch Address

    0

    0

    1

    2k-1

    k

    a

    Branch History Table

    Table size (usually n = 2): #bits = k * 2a + 2k * 2m *n

    Variant: Gshare (Scott McFarling’93): GA which takes logic OR of PC address bits and branch history bits

    ACA H.Corporaal


    Advanced computer architecture 5md00 5z033 ilp architectures

    Accuracy (taking the best combination of parameters):

    GA(0,11,5,2)

    98

    PA(10, 6, 4, 2)

    97

    96

    95

    Bimodal

    94

    GAs

    Branch Prediction Accuracy (%)

    93

    PAs

    92

    91

    89

    64

    128

    256

    1K

    2K

    4K

    8K

    16K

    32K

    64K

    Predictor Size (bytes)

    ACA H.Corporaal


    Accuracy of different branch predictors

    Accuracy of Different Branch Predictors

    18%

    Mispredictions Rate

    0%

    4096 Entries 2-bit BHTUnlimited Entries 2-bit BHT1024 Entries (2,2) BHT

    ACA H.Corporaal


    Bht accuracy

    BHT Accuracy

    • Mispredict because either:

      • Wrong guess for that branch

      • Got branch history of wrong branch when index the table

    • 4096 entry table: misprediction rates vary from 1% (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12%

    • For SPEC92, 4096 about as good as infinite table

    • Real programs + OS more like gcc

    ACA H.Corporaal


    Branch target buffer

    Tag branch PC

    PC if taken

    Branch Target Buffer

    • Branch condition is not enough !!

    • Branch Target Buffer (BTB): Tag and Target address

    PC

    10…..10 101 00

    Yes: instruction is branch. Use predicted PC as next PC if branch predicted taken.

    Branch

    prediction

    (often in separate

    table)

    =?

    No: instruction is not a

    branch. Proceed normally

    ACA H.Corporaal


    Instruction fetch stage

    Instruction Fetch Stage

    Not shown: hardware needed when prediction was wrong

    4

    Instruction

    Memory

    Instruction register

    PC

    BTB

    found & taken

    target address

    ACA H.Corporaal


    Special case return addresses

    Special Case: Return Addresses

    • Register indirect branches: hard to predict target address

      • MIPS instruction: jr r31 ; PC = r31

      • useful for

        • implementing switch/case statements

        • FORTRAN computed GOTOs

        • procedure return (mainly)

    • SPEC89: 85% such branches for procedure return

    • Since stack discipline for procedures, save return address in small buffer that acts like a stack: 8 to 16 entries has very high hit rate

    ACA H.Corporaal


    Dynamic branch prediction summary

    Dynamic Branch Prediction Summary

    • Prediction important part of scalar execution

    • Branch History Table: 2 bits for loop accuracy

    • Correlation: Recently executed branches correlated with next branch

      • Either different branches

      • Or different executions of same branch

    • Branch Target Buffer: include branch target address (& prediction)

    • Return address stack for prediction of indirect jumps

    ACA H.Corporaal


    Or avoid branches

    Or: Avoid branches !

    ACA H.Corporaal


    Predicated instructions

    Predicated Instructions

    • Avoid branch prediction by turning branches into conditional or predicated instructions:

    • If false, then neither store result nor cause exception

      • Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move; PA-RISC can annul any following instr.

      • IA-64/Itanium: conditional execution of any instruction

    • Examples:

      if (R1==0) R2 = R3;CMOVZ R2,R3,R1

      if (R1 < R2)SLT R9,R1,R2

      R3 = R1;CMOVNZ R3,R1,R9

      elseCMOVZ R3,R2,R9

      R3 = R2;

    ACA H.Corporaal


    Dynamic scheduling

    Dynamic Scheduling

    ACA H.Corporaal


    Dynamic scheduling principle

    This instruction cannot continue

    even though it does not depend

    on anything

    Dynamic Scheduling Principle

    • What we examined so far is static scheduling

      • Compiler reorders instructions so as to avoid hazards and reduce stalls

    • Dynamic scheduling: hardware rearranges instruction execution to reduce stalls

    • Example:

      DIV.D F0,F2,F4; takes 24 cycles and

      ; is not pipeline

      ADD.D F10,F0,F8

      SUB.D F12,F8,F14

    • Key idea: Allow instructions behind stall to proceed

    • Book describes Tomasulo algorithm, but we describe general idea

    ACA H.Corporaal


    Advantages of dynamic scheduling

    Advantages ofDynamic Scheduling

    • Handles cases when dependences unknown at compile time

      • e.g., because they may involve a memory reference

    • It simplifies the compiler

    • Allows code compiled for one or no pipeline to run efficiently on a different pipeline

    • Hardware speculation, a technique with significant performance advantages, that builds on dynamic scheduling

    ACA H.Corporaal


    Superscalar concept

    Superscalar Concept

    Instruction

    Memory

    Instruction

    Instruction

    Cache

    Decoder

    Reservation Stations

    Branch

    Unit

    ALU-1

    ALU-2

    Logic &

    Shift

    Load

    Unit

    Store

    Unit

    Address

    Data

    Cache

    Data

    Reorder

    Buffer

    Data

    Register

    File

    Data

    Memory

    ACA H.Corporaal


    Superscalar issues

    Superscalar Issues

    • How to fetch multiple instructions in time (across basic block boundaries) ?

    • Predicting branches

    • Non-blocking memory system

    • Tune #resources(FUs, ports, entries, etc.)

    • Handling dependencies

    • How to support precise interrupts?

    • How to recover from mis-predicted branch path?

    • For the latter two issues we need to look at sequential look-ahead and architectural state

      • Ref: Johnson 91

    ACA H.Corporaal


    Example of superscalar processor execution

    Example of Superscalar Processor Execution

    • Superscalar processor organization:

      • simple pipeline: IF, EX, WB

      • fetches 2 instructions each cycle

      • 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier

      • Instruction window (buffer between IF and EX stage) is of size 2

      • FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc

        Cycle123456 7

        L.D F6,32(R2)

        L.D F2,48(R3)

        MUL.D F0,F2,F4

        SUB.D F8,F2,F6

        DIV.D F10,F0,F6

        ADD.D F6,F8,F2

        MUL.D F12,F2,F4

    ACA H.Corporaal


    Example of superscalar processor execution1

    Example of Superscalar Processor Execution

    • Superscalar processor organization:

      • simple pipeline: IF, EX, WB

      • fetches 2 instructions each cycle

      • 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier

      • Instruction window (buffer between IF and EX stage) is of size 2

      • FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc

        Cycle123456 7

        L.D F6,32(R2)IF

        L.D F2,48(R3)IF

        MUL.D F0,F2,F4

        SUB.D F8,F2,F6

        DIV.D F10,F0,F6

        ADD.D F6,F8,F2

        MUL.D F12,F2,F4

    ACA H.Corporaal


    Example of superscalar processor execution2

    Example of Superscalar Processor Execution

    • Superscalar processor organization:

      • simple pipeline: IF, EX, WB

      • fetches 2 instructions each cycle

      • 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier

      • Instruction window (buffer between IF and EX stage) is of size 2

      • FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc

        Cycle123456 7

        L.D F6,32(R2)IFEX

        L.D F2,48(R3)IFEX

        MUL.D F0,F2,F4IF

        SUB.D F8,F2,F6IF

        DIV.D F10,F0,F6

        ADD.D F6,F8,F2

        MUL.D F12,F2,F4

    ACA H.Corporaal


    Example of superscalar processor execution3

    Example of Superscalar Processor Execution

    • Superscalar processor organization:

      • simple pipeline: IF, EX, WB

      • fetches 2 instructions each cycle

      • 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier

      • Instruction window (buffer between IF and EX stage) is of size 2

      • FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc

        Cycle123456 7

        L.D F6,32(R2)IFEXWB

        L.D F2,48(R3)IFEXWB

        MUL.D F0,F2,F4IFEX

        SUB.D F8,F2,F6IFEX

        DIV.D F10,F0,F6IF

        ADD.D F6,F8,F2IF

        MUL.D F12,F2,F4

    ACA H.Corporaal


    Example of superscalar processor execution4

    Example of Superscalar Processor Execution

    • Superscalar processor organization:

      • simple pipeline: IF, EX, WB

      • fetches 2 instructions each cycle

      • 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier

      • Instruction window (buffer between IF and EX stage) is of size 2

      • FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc

        Cycle123456 7

        L.D F6,32(R2)IFEXWB

        L.D F2,48(R3)IFEXWB

        MUL.D F0,F2,F4IFEXEX

        SUB.D F8,F2,F6IFEXEX

        DIV.D F10,F0,F6IF

        ADD.D F6,F8,F2IF

        MUL.D F12,F2,F4

    stall because

    of data dep.

    cannot be fetched because window full

    ACA H.Corporaal


    Example of superscalar processor execution5

    Example of Superscalar Processor Execution

    • Superscalar processor organization:

      • simple pipeline: IF, EX, WB

      • fetches 2 instructions each cycle

      • 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier

      • Instruction window (buffer between IF and EX stage) is of size 2

      • FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc

        Cycle123456 7

        L.D F6,32(R2)IFEXWB

        L.D F2,48(R3)IFEXWB

        MUL.D F0,F2,F4IFEXEXEX

        SUB.D F8,F2,F6IFEXEXWB

        DIV.D F10,F0,F6IF

        ADD.D F6,F8,F2IFEX

        MUL.D F12,F2,F4IF

    ACA H.Corporaal


    Example of superscalar processor execution6

    Example of Superscalar Processor Execution

    • Superscalar processor organization:

      • simple pipeline: IF, EX, WB

      • fetches 2 instructions each cycle

      • 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier

      • Instruction window (buffer between IF and EX stage) is of size 2

      • FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc

        Cycle123456 7

        L.D F6,32(R2)IFEXWB

        L.D F2,48(R3)IFEXWB

        MUL.D F0,F2,F4IFEXEXEXEX

        SUB.D F8,F2,F6IFEXEXWB

        DIV.D F10,F0,F6IF

        ADD.D F6,F8,F2IFEXEX

        MUL.D F12,F2,F4IF

    cannot execute

    structural hazard

    ACA H.Corporaal


    Example of superscalar processor execution7

    Example of Superscalar Processor Execution

    • Superscalar processor organization:

      • simple pipeline: IF, EX, WB

      • fetches 2 instructions each cycle

      • 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier

      • Instruction window (buffer between IF and EX stage) is of size 2

      • FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc

        Cycle123456 7

        L.D F6,32(R2)IFEXWB

        L.D F2,48(R3)IFEXWB

        MUL.D F0,F2,F4IFEXEXEXEXWB

        SUB.D F8,F2,F6IFEXEXWB

        DIV.D F10,F0,F6IFEX

        ADD.D F6,F8,F2IFEXEXWB

        MUL.D F12,F2,F4IF?

    ACA H.Corporaal


    Register renaming

    Register Renaming

    • A technique to eliminate anti- and output dependencies

    • Can be implemented

      • by the compiler

        • advantage: low cost

        • disadvantage: “old” codes perform poorly

      • in hardware

        • advantage: binary compatibility

        • disadvantage: extra hardware needed

    • We describe general idea

    ACA H.Corporaal


    Register renaming1

    before:

    add r3,r3,4

    after:

    add R2,R1,4

    mapping table:

    mapping table:

    r0

    r0

    R8

    R8

    r1

    r1

    R7

    R7

    r2

    r2

    R5

    R5

    r3

    r3

    R1

    R2

    r4

    r4

    R9

    R9

    free list:

    free list:

    R2 R6

    R6

    Register Renaming

    • there’s a physical register file larger than logical register file

    • mapping table associates logical registers with physical register

    • when an instruction is decoded

      • its physical source registers are obtained from mapping table

      • its physical destination register is obtained from a free list

      • mapping table is updated

    ACA H.Corporaal


    Eliminating false dependencies

    Eliminating False Dependencies

    • How register renaming eliminates false dependencies:

    • Before:

      • addir1, r2, 1

      • addir2, r0, 0

      • addir1, r0, 1

  • After(free list: R7, R8, R9)

    • addiR7, R5, 1

    • addiR8, R0, 0

    • addiR9, R0, 1

  • ACA H.Corporaal


    Limitations of multiple issue processors

    Limitations of Multiple-Issue Processors

    • Available ILP is limited (we’re not programming with parallelism in mind)

    • Hardware cost

      • adding more functional units is easy

      • more memory ports and register ports needed

      • dependency check needs O(n2) comparisons

    • Limitations of VLIW processors

      • Loop unrolling increases code size

      • Unfilled slots waste bits

      • Cache miss stalls pipeline

        • Research topic: scheduling loads

      • Binary incompatibility (not EPIC)

    ACA H.Corporaal


    Measuring available ilp how

    Measuring available ILP: How?

    • Using existing compiler

    • Using trace analysis

      • Track all the real data dependencies (RaWs) of instructions from issue window

        • register dependence

        • memory dependence

      • Check for correct branch prediction

        • if prediction correct continue

        • if wrong, flush schedule and start in next cycle

    ACA H.Corporaal


    Trace analysis

    Trace

    set r1,0

    set r2,3

    set r3,&A

    st r1,0(r3)

    add r1,r1,1

    add r3,r3,4

    brne r1,r2,Loop

    st r1,0(r3)

    add r1,r1,1

    add r3,r3,4

    brne r1,r2,Loop

    st r1,0(r3)

    add r1,r1,1

    add r3,r3,4

    brne r1,r2,Loop

    add r1,r5,3

    Trace analysis

    Compiled code

    set r1,0

    set r2,3

    set r3,&A

    Loop:st r1,0(r3)

    add r1,r1,1

    add r3,r3,4

    brne r1,r2,Loop

    add r1,r5,3

    Program

    For i := 0..2

    A[i] := i;

    S := X+3;

    How parallel can this code be executed?

    ACA H.Corporaal


    Trace analysis1

    Trace analysis

    Parallel Trace

    set r1,0 set r2,3 set r3,&A

    st r1,0(r3) add r1,r1,1 add r3,r3,4

    st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop

    st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop

    brne r1,r2,Loop

    add r1,r5,3

    Max ILP =

    Speedup = Lparallel / Lserial = 16 / 6 = 2.7

    Is this the maximum?

    ACA H.Corporaal


    Ideal processor

    Ideal Processor

    Assumptions for ideal/perfect processor:

    1. Register renaming– infinite number of virtual registers => all register WAW & WAR hazards avoided

    2. Branch and Jump prediction– Perfect => all program instructions available for execution

    3. Memory-address alias analysis– addresses are known. A store can be moved before a load provided addresses not equal

    Also:

    • unlimited number of instructions issued/cycle (unlimited resources), and

    • unlimited instruction window

    • perfect caches

    • 1 cycle latency for all instructions (FP *,/)

      Programs were compiled using MIPS compiler with maximum optimization level

    ACA H.Corporaal


    Upper limit to ilp ideal processor

    Upper Limit to ILP: Ideal Processor

    Integer: 18 - 60

    FP: 75 - 150

    IPC

    ACA H.Corporaal


    Window size and branch impact

    Window Size and Branch Impact

    • Change from infinite window to examine 2000 and issue at most 64 instructions per cycle

    FP: 15 - 45

    Integer: 6 – 12

    IPC

    PerfectTournamentBHT(512)ProfileNo prediction

    ACA H.Corporaal


    Impact of limited renaming registers

    Impact of Limited Renaming Registers

    • Changes: 2000 instr. window, 64 instr. issue, 8K 2-level predictor (slightly better than tournament predictor)

    FP: 11 - 45

    Integer: 5 - 15

    IPC

    Infinite2561286432

    ACA H.Corporaal


    Memory address alias impact

    Memory Address Alias Impact

    • Changes: 2000 instr. window, 64 instr. issue, 8K 2-level predictor, 256 renaming registers

    FP: 4 - 45

    (Fortran,

    no heap)

    Integer: 4 - 9

    IPC

    PerfectGlobal/stack perfectInspectionNone

    ACA H.Corporaal


    Window size impact

    Window Size Impact

    • Assumptions: Perfect disambiguation, 1K Selective predictor, 16 entry return stack, 64 renaming registers, issue as many as window

    FP: 8 - 45

    IPC

    Integer: 6 - 12

    ACA H.Corporaal


    How to exceed ilp limits of this study

    How to Exceed ILP Limits of this Study?

    • WAR and WAW hazards through memory: eliminated WAW and WAR hazards through register renaming, but not for memory operands

    • Unnecessary dependences

      • (compiler did not unroll loops so iteration variable dependence)

    • Overcoming the data flow limit: value prediction, predicting values and speculating on prediction

      • Address value prediction and speculation predicts addresses and speculates by reordering loads and stores. Could provide better aliasing analysis

    ACA H.Corporaal


    Workstation microprocessors 3 2001

    Workstation Microprocessors 3/2001

    • Max issue: 4 instructions (many CPUs)Max rename registers: 128 (Pentium 4) Max BHT: 4K x 9 (Alpha 21264B), 16Kx2 (Ultra III)Max Window Size (OOO): 126 intructions (Pent. 4)Max Pipeline: 22/24 stages (Pentium 4)

    Source: Microprocessor Report, www.MPRonline.com

    ACA H.Corporaal


    Spec 2000 performance 3 2001 source microprocessor report www mpronline com

    1.5X

    3.8X

    1.2X

    1.6X

    1.7X

    SPEC 2000 Performance 3/2001 Source: Microprocessor Report, www.MPRonline.com

    ACA H.Corporaal


    Conclusions

    Conclusions

    • 1985-2002: >1000X performance (55% / y)

    • Hennessy: industry has been following a roadmap of ideas known in 1985 to exploit Instruction Level Parallelism and (real) Moore’s Law to get 1.55X/year

      • Caches, (Super)Pipelining, Superscalar, Branch Prediction, Out-of-order execution, Trace cache

    • After 2002 slowdown (about 20%/y)

    ACA H.Corporaal


    Conclusions cont d

    Conclusions (cont'd)

    • ILP limits: To make performance progress in future need to have explicit parallelism from programmer vs. implicit parallelism of ILP exploited by compiler/HW?

    • Other problem:

      • Processor-memory performance gap

      • VLSI scaling problems (wiring)

      • Energy / leakage problems

    • However: other forms of parallelism come to rescue

    ACA H.Corporaal


  • Login