advanced computer architecture 5md00 5z033 ilp architectures
Download
Skip this Video
Download Presentation
Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

Loading in 2 Seconds...

play fullscreen
1 / 55

Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures - PowerPoint PPT Presentation


  • 139 Views
  • Uploaded on

Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures. Henk Corporaal www.ics.ele.tue.nl/~heco/courses/aca [email protected] TUEindhoven 2007. Topics. Introduction Hazards Data dependences Control dependences Branch prediction Dependences limit ILP: scheduling

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures' - hafwen


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
advanced computer architecture 5md00 5z033 ilp architectures

Advanced Computer Architecture5MD00 / 5Z033ILP architectures

Henk Corporaal

www.ics.ele.tue.nl/~heco/courses/aca

[email protected]

TUEindhoven

2007

topics
Topics
  • Introduction
  • Hazards
    • Data dependences
    • Control dependences
  • Branch prediction
  • Dependences limit ILP: scheduling
  • Out-Of-Order execution: Hardware speculation
  • Multiple issue
  • How much ILP is there?

ACA H.Corporaal

introduction
Introduction

ILP = Instruction level parallelism

  • multiple operations (or instructions) can be executed in parallel

Needed:

  • Sufficient resources
  • Parallel scheduling
    • Hardware solution
    • Software solution
  • Application should contain ILP

ACA H.Corporaal

hazards
Hazards
  • Three types of hazards
    • Structural
    • Data dependence
    • Control dependence
  • Hazards cause scheduling problems

ACA H.Corporaal

data dependences
Data dependences
  • RaW read after write
  • WaR write after read
  • WaW write after write

ACA H.Corporaal

control dependences
Control Dependences

C input code:

if (a > b) { r = a % b; }

else { r = b % a; }

y = a*b;

1

sub t1, a, b

bgz t1, 2, 3

CFG:

2

rem r, a, b

goto 4

3

rem r, b, a

goto 4

4

mul y,a,b

…………..

How real are control dependences?

ACA H.Corporaal

branch prediction
Branch Prediction

ACA H.Corporaal

branch prediction motivation
Branch PredictionMotivation
  • High branch penalties in pipelined processors:
  • With on average one out of five instructions being a branch, the maximum ILP is five
  • Situation even worse for multiple-issue processors, because we need to provide an instruction stream of n instructions per cycle.
  • Idea: predict the outcome of branches based on their history and execute instructions speculatively

ACA H.Corporaal

5 branch prediction schemes
5 Branch Prediction Schemes
  • 1-bit Branch Prediction Buffer
  • 2-bit Branch Prediction Buffer
  • Correlating Branch Prediction Buffer
  • Branch Target Buffer
  • Return Address Predictors

+ A way to get rid of those malicious branches

ACA H.Corporaal

1 bit branch prediction buffer
1-bit Branch Prediction Buffer
  • 1-bit branch prediction buffer or branch history table:
  • Buffer is like a cache without tags
  • Does not help for simple MIPS pipeline because target address calculations in same stage as branch condition calculation

PC

10…..10 101 00

BHT

0

1

0

1

0

1

1

0

ACA H.Corporaal

branch prediction buffer 1 bit prediction
Branch Prediction Buffer: 1 bit prediction

Branch address

2 K entries

(K bits)

predictionbit

  • Problems:
  • Aliasing: lower K bits of different branch instructions could be the same
    • Soln: Use tags (the buffer becomes a tag); however very expensive
  • Loops are predicted wrong twice
    • Soln: Use n-bit saturation counter prediction
      • taken if counter  2 (n-1)
      • not-taken if counter < 2 (n-1)
    • A 2 bit saturating counter predicts a loop wrong only once

ACA H.Corporaal

2 bit branch prediction buffer
T

NT

Predict Taken

Predict Taken

T

T

NT

NT

Predict Not

Taken

Predict Not

Taken

T

NT

2-bit Branch Prediction Buffer
  • Solution: 2-bit scheme where prediction is changed only if mispredicted twice
  • Can be implemented as a saturating counter:

ACA H.Corporaal

correlating branches
Correlating Branches
  • Fragment from SPEC92 benchmark eqntott:

if (aa==2)

aa = 0;

if (bb==2)

bb=0;

if (aa!=bb){..}

subi R3,R1,#2

b1: bnez R3,L1

add R1,R0,R0

L1: subi R3,R2,#2

b2: bnez R3,L2

add R2,R0,R0

L2: sub R3,R1,R2

b3: beqz R3,L3

ACA H.Corporaal

correlating branch predictor
Correlating Branch Predictor

Idea: behavior of current branch is related to taken/not taken history of recently executed branches

    • Then behavior of recent branches selects between, say, 4 predictions of next branch, updating just that prediction
  • (2,2) predictor: 2-bit global, 2-bit local
  • (k,n) predictor uses behavior of last k branches to choose from 2k predictors, each of which is n-bit predictor
  • 4 bits from branch address

2-bits per branch local predictors

Prediction

shift

register

2-bit global

branch history

(01 = not taken, then taken)

ACA H.Corporaal

branch correlation using branch history
Branch Correlation Using Branch History
  • Two schemes (a, k, m, n)
    • PA: Per address history, a > 0
    • GA: Global history, a = 0

Pattern History Table

2m-1

n-bit

saturating

Up/Down

Counter

m

1

Prediction

Branch Address

0

0

1

2k-1

k

a

Branch History Table

Table size (usually n = 2): #bits = k * 2a + 2k * 2m *n

Variant: Gshare (Scott McFarling’93): GA which takes logic OR of PC address bits and branch history bits

ACA H.Corporaal

slide16
Accuracy (taking the best combination of parameters):

GA(0,11,5,2)

98

PA(10, 6, 4, 2)

97

96

95

Bimodal

94

GAs

Branch Prediction Accuracy (%)

93

PAs

92

91

89

64

128

256

1K

2K

4K

8K

16K

32K

64K

Predictor Size (bytes)

ACA H.Corporaal

accuracy of different branch predictors
Accuracy of Different Branch Predictors

18%

Mispredictions Rate

0%

4096 Entries 2-bit BHTUnlimited Entries 2-bit BHT1024 Entries (2,2) BHT

ACA H.Corporaal

bht accuracy
BHT Accuracy
  • Mispredict because either:
    • Wrong guess for that branch
    • Got branch history of wrong branch when index the table
  • 4096 entry table: misprediction rates vary from 1% (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12%
  • For SPEC92, 4096 about as good as infinite table
  • Real programs + OS more like gcc

ACA H.Corporaal

branch target buffer
Tag branch PC

PC if taken

Branch Target Buffer
  • Branch condition is not enough !!
  • Branch Target Buffer (BTB): Tag and Target address

PC

10…..10 101 00

Yes: instruction is branch. Use predicted PC as next PC if branch predicted taken.

Branch

prediction

(often in separate

table)

=?

No: instruction is not a

branch. Proceed normally

ACA H.Corporaal

instruction fetch stage
Instruction Fetch Stage

Not shown: hardware needed when prediction was wrong

4

Instruction

Memory

Instruction register

PC

BTB

found & taken

target address

ACA H.Corporaal

special case return addresses
Special Case: Return Addresses
  • Register indirect branches: hard to predict target address
    • MIPS instruction: jr r31 ; PC = r31
    • useful for
      • implementing switch/case statements
      • FORTRAN computed GOTOs
      • procedure return (mainly)
  • SPEC89: 85% such branches for procedure return
  • Since stack discipline for procedures, save return address in small buffer that acts like a stack: 8 to 16 entries has very high hit rate

ACA H.Corporaal

dynamic branch prediction summary
Dynamic Branch Prediction Summary
  • Prediction important part of scalar execution
  • Branch History Table: 2 bits for loop accuracy
  • Correlation: Recently executed branches correlated with next branch
    • Either different branches
    • Or different executions of same branch
  • Branch Target Buffer: include branch target address (& prediction)
  • Return address stack for prediction of indirect jumps

ACA H.Corporaal

or avoid branches
Or: Avoid branches !

ACA H.Corporaal

predicated instructions
Predicated Instructions
  • Avoid branch prediction by turning branches into conditional or predicated instructions:
  • If false, then neither store result nor cause exception
    • Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move; PA-RISC can annul any following instr.
    • IA-64/Itanium: conditional execution of any instruction
  • Examples:

if (R1==0) R2 = R3; CMOVZ R2,R3,R1

if (R1 < R2) SLT R9,R1,R2

R3 = R1; CMOVNZ R3,R1,R9

else CMOVZ R3,R2,R9

R3 = R2;

ACA H.Corporaal

dynamic scheduling
Dynamic Scheduling

ACA H.Corporaal

dynamic scheduling principle
This instruction cannot continue

even though it does not depend

on anything

Dynamic Scheduling Principle
  • What we examined so far is static scheduling
    • Compiler reorders instructions so as to avoid hazards and reduce stalls
  • Dynamic scheduling: hardware rearranges instruction execution to reduce stalls
  • Example:

DIV.D F0,F2,F4 ; takes 24 cycles and

; is not pipeline

ADD.D F10,F0,F8

SUB.D F12,F8,F14

  • Key idea: Allow instructions behind stall to proceed
  • Book describes Tomasulo algorithm, but we describe general idea

ACA H.Corporaal

advantages of dynamic scheduling
Advantages ofDynamic Scheduling
  • Handles cases when dependences unknown at compile time
    • e.g., because they may involve a memory reference
  • It simplifies the compiler
  • Allows code compiled for one or no pipeline to run efficiently on a different pipeline
  • Hardware speculation, a technique with significant performance advantages, that builds on dynamic scheduling

ACA H.Corporaal

superscalar concept
Superscalar Concept

Instruction

Memory

Instruction

Instruction

Cache

Decoder

Reservation Stations

Branch

Unit

ALU-1

ALU-2

Logic &

Shift

Load

Unit

Store

Unit

Address

Data

Cache

Data

Reorder

Buffer

Data

Register

File

Data

Memory

ACA H.Corporaal

superscalar issues
Superscalar Issues
  • How to fetch multiple instructions in time (across basic block boundaries) ?
  • Predicting branches
  • Non-blocking memory system
  • Tune #resources(FUs, ports, entries, etc.)
  • Handling dependencies
  • How to support precise interrupts?
  • How to recover from mis-predicted branch path?
  • For the latter two issues we need to look at sequential look-ahead and architectural state
    • Ref: Johnson 91

ACA H.Corporaal

example of superscalar processor execution
Example of Superscalar Processor Execution
  • Superscalar processor organization:
    • simple pipeline: IF, EX, WB
    • fetches 2 instructions each cycle
    • 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier
    • Instruction window (buffer between IF and EX stage) is of size 2
    • FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc

Cycle 1 2 3 4 5 6 7

L.D F6,32(R2)

L.D F2,48(R3)

MUL.D F0,F2,F4

SUB.D F8,F2,F6

DIV.D F10,F0,F6

ADD.D F6,F8,F2

MUL.D F12,F2,F4

ACA H.Corporaal

example of superscalar processor execution1
Example of Superscalar Processor Execution
  • Superscalar processor organization:
    • simple pipeline: IF, EX, WB
    • fetches 2 instructions each cycle
    • 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier
    • Instruction window (buffer between IF and EX stage) is of size 2
    • FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc

Cycle 1 2 3 4 5 6 7

L.D F6,32(R2) IF

L.D F2,48(R3) IF

MUL.D F0,F2,F4

SUB.D F8,F2,F6

DIV.D F10,F0,F6

ADD.D F6,F8,F2

MUL.D F12,F2,F4

ACA H.Corporaal

example of superscalar processor execution2
Example of Superscalar Processor Execution
  • Superscalar processor organization:
    • simple pipeline: IF, EX, WB
    • fetches 2 instructions each cycle
    • 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier
    • Instruction window (buffer between IF and EX stage) is of size 2
    • FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc

Cycle 1 2 3 4 5 6 7

L.D F6,32(R2) IF EX

L.D F2,48(R3) IF EX

MUL.D F0,F2,F4 IF

SUB.D F8,F2,F6 IF

DIV.D F10,F0,F6

ADD.D F6,F8,F2

MUL.D F12,F2,F4

ACA H.Corporaal

example of superscalar processor execution3
Example of Superscalar Processor Execution
  • Superscalar processor organization:
    • simple pipeline: IF, EX, WB
    • fetches 2 instructions each cycle
    • 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier
    • Instruction window (buffer between IF and EX stage) is of size 2
    • FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc

Cycle 1 2 3 4 5 6 7

L.D F6,32(R2) IF EX WB

L.D F2,48(R3) IF EX WB

MUL.D F0,F2,F4 IF EX

SUB.D F8,F2,F6 IF EX

DIV.D F10,F0,F6 IF

ADD.D F6,F8,F2 IF

MUL.D F12,F2,F4

ACA H.Corporaal

example of superscalar processor execution4
Example of Superscalar Processor Execution
  • Superscalar processor organization:
    • simple pipeline: IF, EX, WB
    • fetches 2 instructions each cycle
    • 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier
    • Instruction window (buffer between IF and EX stage) is of size 2
    • FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc

Cycle 1 2 3 4 5 6 7

L.D F6,32(R2) IF EX WB

L.D F2,48(R3) IF EX WB

MUL.D F0,F2,F4 IF EX EX

SUB.D F8,F2,F6 IF EX EX

DIV.D F10,F0,F6 IF

ADD.D F6,F8,F2 IF

MUL.D F12,F2,F4

stall because

of data dep.

cannot be fetched because window full

ACA H.Corporaal

example of superscalar processor execution5
Example of Superscalar Processor Execution
  • Superscalar processor organization:
    • simple pipeline: IF, EX, WB
    • fetches 2 instructions each cycle
    • 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier
    • Instruction window (buffer between IF and EX stage) is of size 2
    • FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc

Cycle 1 2 3 4 5 6 7

L.D F6,32(R2) IF EX WB

L.D F2,48(R3) IF EX WB

MUL.D F0,F2,F4 IF EX EX EX

SUB.D F8,F2,F6 IF EX EX WB

DIV.D F10,F0,F6 IF

ADD.D F6,F8,F2 IF EX

MUL.D F12,F2,F4 IF

ACA H.Corporaal

example of superscalar processor execution6
Example of Superscalar Processor Execution
  • Superscalar processor organization:
    • simple pipeline: IF, EX, WB
    • fetches 2 instructions each cycle
    • 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier
    • Instruction window (buffer between IF and EX stage) is of size 2
    • FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc

Cycle 1 2 3 4 5 6 7

L.D F6,32(R2) IF EX WB

L.D F2,48(R3) IF EX WB

MUL.D F0,F2,F4 IF EX EX EX EX

SUB.D F8,F2,F6 IF EX EX WB

DIV.D F10,F0,F6 IF

ADD.D F6,F8,F2 IF EX EX

MUL.D F12,F2,F4 IF

cannot execute

structural hazard

ACA H.Corporaal

example of superscalar processor execution7
Example of Superscalar Processor Execution
  • Superscalar processor organization:
    • simple pipeline: IF, EX, WB
    • fetches 2 instructions each cycle
    • 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier
    • Instruction window (buffer between IF and EX stage) is of size 2
    • FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc

Cycle 1 2 3 4 5 6 7

L.D F6,32(R2) IF EX WB

L.D F2,48(R3) IF EX WB

MUL.D F0,F2,F4 IF EX EX EX EX WB

SUB.D F8,F2,F6 IF EX EX WB

DIV.D F10,F0,F6 IF EX

ADD.D F6,F8,F2 IF EX EX WB

MUL.D F12,F2,F4 IF ?

ACA H.Corporaal

register renaming
Register Renaming
  • A technique to eliminate anti- and output dependencies
  • Can be implemented
    • by the compiler
      • advantage: low cost
      • disadvantage: “old” codes perform poorly
    • in hardware
      • advantage: binary compatibility
      • disadvantage: extra hardware needed
  • We describe general idea

ACA H.Corporaal

register renaming1
before:

add r3,r3,4

after:

add R2,R1,4

mapping table:

mapping table:

r0

r0

R8

R8

r1

r1

R7

R7

r2

r2

R5

R5

r3

r3

R1

R2

r4

r4

R9

R9

free list:

free list:

R2 R6

R6

Register Renaming
  • there’s a physical register file larger than logical register file
  • mapping table associates logical registers with physical register
  • when an instruction is decoded
    • its physical source registers are obtained from mapping table
    • its physical destination register is obtained from a free list
    • mapping table is updated

ACA H.Corporaal

eliminating false dependencies
Eliminating False Dependencies
  • How register renaming eliminates false dependencies:
  • Before:
      • addi r1, r2, 1
      • addi r2, r0, 0
      • addi r1, r0, 1
  • After (free list: R7, R8, R9)
      • addi R7, R5, 1
      • addi R8, R0, 0
      • addi R9, R0, 1

ACA H.Corporaal

limitations of multiple issue processors
Limitations of Multiple-Issue Processors
  • Available ILP is limited (we’re not programming with parallelism in mind)
  • Hardware cost
    • adding more functional units is easy
    • more memory ports and register ports needed
    • dependency check needs O(n2) comparisons
  • Limitations of VLIW processors
    • Loop unrolling increases code size
    • Unfilled slots waste bits
    • Cache miss stalls pipeline
      • Research topic: scheduling loads
    • Binary incompatibility (not EPIC)

ACA H.Corporaal

measuring available ilp how
Measuring available ILP: How?
  • Using existing compiler
  • Using trace analysis
    • Track all the real data dependencies (RaWs) of instructions from issue window
      • register dependence
      • memory dependence
    • Check for correct branch prediction
      • if prediction correct continue
      • if wrong, flush schedule and start in next cycle

ACA H.Corporaal

trace analysis
Trace

set r1,0

set r2,3

set r3,&A

st r1,0(r3)

add r1,r1,1

add r3,r3,4

brne r1,r2,Loop

st r1,0(r3)

add r1,r1,1

add r3,r3,4

brne r1,r2,Loop

st r1,0(r3)

add r1,r1,1

add r3,r3,4

brne r1,r2,Loop

add r1,r5,3

Trace analysis

Compiled code

set r1,0

set r2,3

set r3,&A

Loop: st r1,0(r3)

add r1,r1,1

add r3,r3,4

brne r1,r2,Loop

add r1,r5,3

Program

For i := 0..2

A[i] := i;

S := X+3;

How parallel can this code be executed?

ACA H.Corporaal

trace analysis1
Trace analysis

Parallel Trace

set r1,0 set r2,3 set r3,&A

st r1,0(r3) add r1,r1,1 add r3,r3,4

st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop

st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop

brne r1,r2,Loop

add r1,r5,3

Max ILP =

Speedup = Lparallel / Lserial = 16 / 6 = 2.7

Is this the maximum?

ACA H.Corporaal

ideal processor
Ideal Processor

Assumptions for ideal/perfect processor:

1. Register renaming– infinite number of virtual registers => all register WAW & WAR hazards avoided

2. Branch and Jump prediction– Perfect => all program instructions available for execution

3. Memory-address alias analysis– addresses are known. A store can be moved before a load provided addresses not equal

Also:

  • unlimited number of instructions issued/cycle (unlimited resources), and
  • unlimited instruction window
  • perfect caches
  • 1 cycle latency for all instructions (FP *,/)

Programs were compiled using MIPS compiler with maximum optimization level

ACA H.Corporaal

upper limit to ilp ideal processor
Upper Limit to ILP: Ideal Processor

Integer: 18 - 60

FP: 75 - 150

IPC

ACA H.Corporaal

window size and branch impact
Window Size and Branch Impact
  • Change from infinite window to examine 2000 and issue at most 64 instructions per cycle

FP: 15 - 45

Integer: 6 – 12

IPC

PerfectTournamentBHT(512)ProfileNo prediction

ACA H.Corporaal

impact of limited renaming registers
Impact of Limited Renaming Registers
  • Changes: 2000 instr. window, 64 instr. issue, 8K 2-level predictor (slightly better than tournament predictor)

FP: 11 - 45

Integer: 5 - 15

IPC

Infinite2561286432

ACA H.Corporaal

memory address alias impact
Memory Address Alias Impact
  • Changes: 2000 instr. window, 64 instr. issue, 8K 2-level predictor, 256 renaming registers

FP: 4 - 45

(Fortran,

no heap)

Integer: 4 - 9

IPC

PerfectGlobal/stack perfectInspectionNone

ACA H.Corporaal

window size impact
Window Size Impact
  • Assumptions: Perfect disambiguation, 1K Selective predictor, 16 entry return stack, 64 renaming registers, issue as many as window

FP: 8 - 45

IPC

Integer: 6 - 12

ACA H.Corporaal

how to exceed ilp limits of this study
How to Exceed ILP Limits of this Study?
  • WAR and WAW hazards through memory: eliminated WAW and WAR hazards through register renaming, but not for memory operands
  • Unnecessary dependences
    • (compiler did not unroll loops so iteration variable dependence)
  • Overcoming the data flow limit: value prediction, predicting values and speculating on prediction
    • Address value prediction and speculation predicts addresses and speculates by reordering loads and stores. Could provide better aliasing analysis

ACA H.Corporaal

workstation microprocessors 3 2001
Workstation Microprocessors 3/2001
  • Max issue: 4 instructions (many CPUs)Max rename registers: 128 (Pentium 4) Max BHT: 4K x 9 (Alpha 21264B), 16Kx2 (Ultra III)Max Window Size (OOO): 126 intructions (Pent. 4)Max Pipeline: 22/24 stages (Pentium 4)

Source: Microprocessor Report, www.MPRonline.com

ACA H.Corporaal

spec 2000 performance 3 2001 source microprocessor report www mpronline com
1.5X

3.8X

1.2X

1.6X

1.7X

SPEC 2000 Performance 3/2001 Source: Microprocessor Report, www.MPRonline.com

ACA H.Corporaal

conclusions
Conclusions
  • 1985-2002: >1000X performance (55% / y)
  • Hennessy: industry has been following a roadmap of ideas known in 1985 to exploit Instruction Level Parallelism and (real) Moore’s Law to get 1.55X/year
    • Caches, (Super)Pipelining, Superscalar, Branch Prediction, Out-of-order execution, Trace cache
  • After 2002 slowdown (about 20%/y)

ACA H.Corporaal

conclusions cont d
Conclusions (cont'd)
  • ILP limits: To make performance progress in future need to have explicit parallelism from programmer vs. implicit parallelism of ILP exploited by compiler/HW?
  • Other problem:
    • Processor-memory performance gap
    • VLSI scaling problems (wiring)
    • Energy / leakage problems
  • However: other forms of parallelism come to rescue

ACA H.Corporaal

ad