advanced computer architecture 5md00 exploiting ilp with sw approaches n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Advanced Computer Architecture 5MD00 Exploiting ILP with SW approaches PowerPoint Presentation
Download Presentation
Advanced Computer Architecture 5MD00 Exploiting ILP with SW approaches

Loading in 2 Seconds...

play fullscreen
1 / 39

Advanced Computer Architecture 5MD00 Exploiting ILP with SW approaches - PowerPoint PPT Presentation


  • 150 Views
  • Uploaded on

Advanced Computer Architecture 5MD00 Exploiting ILP with SW approaches. Henk Corporaal www.ics.ele.tue.nl/~heco TUEindhoven December 2012. Topics. Static branch prediction and speculation Basic compiler techniques Multiple issue architectures Advanced compiler support techniques

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Advanced Computer Architecture 5MD00 Exploiting ILP with SW approaches' - ilana


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
advanced computer architecture 5md00 exploiting ilp with sw approaches

Advanced Computer Architecture5MD00Exploiting ILP with SW approaches

Henk Corporaal

www.ics.ele.tue.nl/~heco

TUEindhoven

December 2012

topics
Topics
  • Static branch prediction and speculation
  • Basic compiler techniques
  • Multiple issue architectures
  • Advanced compiler support techniques
    • Loop-level parallelism
    • Software pipelining
  • Hardware support for compile-time scheduling

ACA H.Corporaal

we discussed previously dynamic branch prediction this does not help the compiler
We discussed previously dynamic branch predictionThis does not help the compiler !!!

Should the compiler speculate operations (= move operations before a branch) from target or fall-through?

  • We need Static Branch Prediction

ACA H.Corporaal

static branch prediction and speculation
Static Branch Prediction and Speculation
  • Static branch prediction useful for code scheduling
  • Example:

ld r1,0(r2)

sub r1,r1,r3 # hazard

beqz r1,L

or r4,r5,r6

addu r10,r4,r3

L: addu r7,r8,r9

  • If the branch is taken most of the times and since r7 is not needed on the fall-through path, we could move addu r7,r8,r9 directly after the ld
  • If the branch is not taken most of the times and assuming that r4 is not needed on the taken path, we could move or r4,r5,r6 after the ld

ACA H.Corporaal

4 static branch prediction methods
4 Static Branch Prediction Methods
  • Always predict taken
    • Average misprediction rate for SPEC: 34% (9%-59%)
  • Backward branches predicted taken, forward branches not taken
    • In SPEC, most forward branches are taken, so always predict taken is better
  • Profiling
    • Run the program and profile all branches. If a branch is taken (not taken) most of the times, it is predicted taken (not taken)
    • Behavior of a branch is often biased to taken or not taken
    • Average misprediction rate for SPECint: 15% (11%-22%), SPECfp: 9% (5%-15%)
  • Can we do better? YES, use control flow restructuring to exploit correlation

ACA H.Corporaal

static exploitation of correlation

a

...

...

bez t,b,c

a

...

...

bez t,b,c

b

c

b

c

d

...

...

bez t,e,f

d'

...

...

bez t,e,f

d

...

...

bez t,e,f

e

f

e

f

g

g

Static exploitation of correlation

If correlation,

branch direction

in block d depends

on branch in block a

control flow

restructuring

ACA H.Corporaal

basic compiler techniques
Basic compiler techniques
  • Dependencies limit ILP (Instruction-Level Parallelism)
    • We can not always find sufficient independent operations to fill all the delay slots
    • May result in pipeline stalls
  • Scheduling to avoid stalls (= reorder instructions)
  • (Source-)code transformations to create more exploitable parallelism
    • Loop Unrolling
    • Loop Merging (Fusion)
      • see online slide-set about loop transformations !!

ACA H.Corporaal

dependencies limit ilp example
Dependencies Limit ILP: Example
  • C loop:
    • for (i=1; i<=1000; i++)
    • x[i] = x[i] + s;

MIPS assembly code:

; R1 = &x[1]

; R2 = &x[1000]+8

; F2 = s

Loop: L.D F0,0(R1) ; F0 = x[i]

ADD.D F4,F0,F2 ; F4 = x[i]+s

S.D 0(R1),F4 ; x[i] = F4

ADDI R1,R1,8 ; R1 = &x[i+1]

BNE R1,R2,Loop ; branch if R1!=&x[1000]+8

ACA H.Corporaal

schedule this on a mips pipeline
Schedule this on a MIPS Pipeline
  • FP operations are mostly multicycle
  • The pipeline must be stalled if an instruction uses the result of a not yet finished multicycle operation
  • We’ll assume the following latencies

Producing Consuming Latency

instruction instruction (clock cycles)

FP ALU op FP ALU op 3

FP ALU op Store double 2

Load double FP ALU op 1

Load double Store double 0

ACA H.Corporaal

where to insert stalls
Where to Insert Stalls?
  • How would this loop be executed on the MIPS FP pipeline?

Inter-iteration

dependence !!

  • Loop: L.D F0,0(R1)
    • ADD.D F4,F0,F2
    • S.D F4,0(R1)
    • ADDI R1,R1,8
    • BNE R1,R2,Loop

What are the true (flow) dependences?

ACA H.Corporaal

where to insert stalls1
Where to Insert Stalls
  • How would this loop be executed on the MIPS FP pipeline?
  • 10 cycles per iteration

Loop: L.D F0,0(R1)

stall

ADD.D F4,F0,F2

stall

stall

S.D 0(R1),F4

ADDI R1,R1,8

stall

BNE R1,R2,Loop

stall

ACA H.Corporaal

code scheduling to avoid stalls
Code Scheduling to Avoid Stalls
  • Can we reorder the order of instruction to avoid stalls?
  • Execution time reduced from 10 to 6 cycles per iteration
  • But only 3 instructions perform useful work, rest is loop overhead. How to avoid this ???

Loop: L.D F0,0(R1)

ADDI R1,R1,8

ADD.D F4,F0,F2

stall

BNE R1,R2,Loop

S.D -8(R1),F4

watch out!

ACA H.Corporaal

loop unrolling increasing ilp
Loop Unrolling: increasing ILP
  • MIPS code after scheduling:
    • Loop: L.D F0,0(R1)
    • L.D F6,8(R1)
    • L.D F10,16(R1)
    • L.D F14,24(R1)
    • ADD.D F4,F0,F2
    • ADD.DF8,F6,F2
    • ADD.D F12,F10,F2
    • ADD.D F16,F14,F2
    • S.D 0(R1),F4
    • S.D 8(R1),F8
    • ADDI R1,R1,32
    • SD -16(R1),F12
    • BNE R1,R2,Loop
    • SD -8(R1),F16

At source level:

for (i=1; i<=1000; i++)

x[i] = x[i] + s;

for (i=1; i<=1000; i=i+4)

{

x[i] = x[i] + s;

x[i+1] = x[i+1]+s;

x[i+2] = x[i+2]+s;

x[i+3] = x[i+3]+s;

}

  • Any drawbacks?
    • loop unrolling increases code size
    • more registers needed

ACA H.Corporaal

multiple issue architectures
Multiple issue architectures

How to get CPI < 1 ?

  • Superscalar: multiple instructions issued per cycle
    • Statically scheduled
    • Dynamically scheduled (see previous lecture)
  • VLIW ?
    • single instruction issue, but multiple operations per instruction (so CPI>1)
  • SIMD / Vector ?
    • single instruction issue, single operation, but multiple data sets per operation (so CPI>1)
  • Multi-threading ? (e.g. x86 Hyperthreading)
  • Multi-processor ? (e.g. x86 Multi-core)

ACA H.Corporaal

instruction parallel ilp processors
Instruction Parallel (ILP) Processors

The name ILP is used for:

  • Multiple-Issue Processors
    • Superscalar: varying no. instructions/cycle (0 to 8), scheduled by HW (dynamic issue capability)
      • IBM PowerPC, Sun UltraSparc, DEC Alpha, Pentium III/4, etc.
    • VLIW (very long instr. word): fixed number of instructions (4-16) scheduled by the compiler (static issue capability)
      • Intel Architecture-64 (IA-64, Itanium), TriMedia, TI C6x
  • (Super-) pipelined processors
  • Anticipated success of multiple instructions led to Instructions Per Cycle (IPC) metric instead of CPI

ACA H.Corporaal

vector processors
Vector processors
  • Vector Processing:Explicit coding of independent loops as operations on large vectors of numbers
    • Multimedia instructions being added to many processors
  • Different implementations:
    • real SIMD;
      • e.g. 320 separate 32-bit ALUs + RFs
    • (multiple) subword units:
      • divide a single ALU into sub ALUs
    • deeply pipelined units:
      • aiming at very high frequency;
      • with forwarding between units

ACA H.Corporaal

simple in order superscalar
Simple In-order Superscalar
  • In-order Superscalar 2-issue processor: 1 Integer & 1 FP
    • Used in first Pentium processor (also in Larrabee, but canceled!!)

– Fetch 64-bits/clock cycle; Int on left, FP on right

– Can only issue 2nd instruction if 1st instruction issues

– More ports needed for FP register file to execute FP load & FP op in parallel

Type Pipe Stages

Int. instruction IF ID EX MEM WB

FP instruction IF ID EX MEM WB

Int. instruction IF ID EX MEM WB

FP instruction IF ID EX MEM WB

Int. instruction IF ID EX MEM WB

FP instruction IF ID EX MEM WB

  • 1 cycle load delay impacts the next 3 instructions !

ACA H.Corporaal

dynamic trace for unrolled code
Dynamic trace for unrolled code

Load: 1 cycle latency

ALU op: 2 cycles latency

for (i=1; i<=1000; i++)

a[i] = a[i]+s;

Integer instruction FP instruction Cycle

L: LD F0,0(R1) 1

LD F6,8(R1) 2

LD F10,16(R1) ADDD F4,F0,F2 3

LD F14,24(R1) ADDD F8,F6,F2 4

LD F18,32(R1) ADDD F12,F10,F2 5

SD 0(R1),F4 ADDD F16,F14,F2 6

SD 8(R1),F8 ADDD F20,F18,F2 7

SD 16(R1),F12 8

ADDI R1,R1,40 9

SD -16(R1),F16 10

BNE R1,R2,L 11

SD -8(R1),F20 12

  • 2.4 cycles per element vs. 3.5 for ordinary MIPS pipeline
  • Int and FP instructions not perfectly balanced

ACA H.Corporaal

superscalar multi issue issues
Superscalar – Multi-issue Issues
  • While Integer/FP split is simple for the HW, get IPC of 2 only for programs with:
    • Exactly 50% FP operations AND no hazards
  • More complex decode and issue! E.g, already for a 2-issue we need:
    • Issue logic: examine 2 opcodes, 6 register specifiers, and decide if 1 or 2 instructions can issue (N-issue ~O(N2) comparisons)
    • Register file complexity: for 2-issue superscalar: needs 4 reads and 2 writes/cycle
    • Rename logic: must be able to rename same register multiple times in one cycle! For instance, consider 4-way issue:

add r1, r2, r3 add p11, p4, p7 sub r4, r1, r2  sub p22, p11, p4 lw r1, 4(r4) lw p23, 4(p22) add r5, r1, r2 add p12, p23, p4

Imagine doing this transformation in a single cycle!

    • Bypassing / Result buses: Need to complete multiple instructions/cycle
      • Need multiple buses with associated matching logic at every reservation station.

ACA H.Corporaal

why not vliw processors

Ld/st 1 Ld/st 2 FP 1 FP 2 Int

LD F0,0(R1) LD F6,8(R1)

LD F10,16(R1) LD F14,24(R1)

LD F18,32(R1) LD F22,40(R1) ADDD F4,F0,F2 ADDD F8,F6,F2

LD F26,48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2

ADD F20,F18,F2 ADD F24,F22,F2

SD 0(R1),F4 SD 8(R1),F8 ADDD F28,F26,F2

SD 16(R1),F12 SD 24(R1),F16

SD 32(R1),F20 SD 40(R1),F24 ADDI R1,R1,56

SD –8(R1),F28 BNE R1,R2,L

Why not VLIW Processors
  • Superscalar HW expensive to build => let compiler find independent instructions and pack them in one Very Long Instruction Word (VLIW)
  • Example: VLIW processor with 2 ld/st units, two FP units, one integer/branch unit, no branch delay

9/7 cycles per iteration !

ACA H.Corporaal

superscalar versus vliw
Superscalar versus VLIW

VLIW advantages:

  • Much simpler to build. Potentially faster

VLIW disadvantages and proposed solutions:

  • Binary code incompatibility
    • Object code translation or emulation
    • Less strict approach (EPIC, IA-64, Itanium)
  • Increase in code size, unfilled slots are wasted bits
    • Use clever encodings, only one immediate field
    • Compress instructions in memory and decode them when they are fetched, or when put in L1 cache
  • Lockstep operation: if the operation in one instruction slot stalls, the entire processor is stalled
    • Less strict approach

ACA H.Corporaal

use compressed instructions
Use compressed instructions

Memory

L1 Instruction Cache

compressed

instructions

in

memory

CPU

or decompress

here?

decompress

here?

Q: What are pros and cons?

ACA H.Corporaal

advanced compiler support techniques
Advanced compiler support techniques
  • Loop-level parallelism
  • Software pipelining
  • Global scheduling (across basic blocks)

ACA H.Corporaal

detecting loop level parallelism
Detecting Loop-Level Parallelism
  • Loop-carried dependence: a statement executed in a certain iteration is dependent on a statement executed in an earlier iteration
  • If there is no loop-carried dependence, then its iterations can be executed in parallel

for (i=1; i<=100; i++){

A[i+1] = A[i]+C[i]; /* S1 */

B[i+1] = B[i]+A[i+1]; /* S2 */

}

S1

S2

A loop is parallel  the corresponding dependence graph

does not contain a cycle

ACA H.Corporaal

finding dependences
Finding Dependences
  • Is there a dependence in the following loop?

for (i=1; i<=100; i++)

A[2*i+3] = A[2*i] + 5.0;

  • Affine expression: an expression of the form a*i + b(a, b constants, i loop index variable)
  • Does the following equation have a solution?

a*i + b = c*j + d

  • GCD test: if there is a solution, then GCD(a,c) must divide d-b

Note: Because the GCD test does not take the loop bounds into account, there are cases where the GCD test says “yes, there is a solution” while in reality there isn’t

ACA H.Corporaal

software pipelining
Software Pipelining
  • We have already seen loop unrolling
  • Software pipelining is a related technique that that consumes less code space. It interleaves instructions from different iterations
    • instructions in one iteration are often dependent on each other

Iteration 0

Iteration 1

Iteration 2

Software-

pipelined

iteration

Steady

state

kernel

instructions

ACA H.Corporaal

simple software pipelining example
Simple Software Pipelining Example

L: l.d f0,0(r1) # load M[i]

add.d f4,f0,f2 # compute M[i]

s.d f4,0(r1) # store M[i]

addi r1,r1,-8 # i = i-1

bne r1,r2,L

  • Software pipelined loop:

L: s.d f4,16(r1) # store M[i]

add.d f4,f0,f2 # compute M[i-1]

l.d f0,0(r1) # load M[i-2]

addi r1,r1,-8

bne r1,r2,L

  • Need hardware to avoid the WAR hazards

ACA H.Corporaal

global code scheduling
Global code scheduling
  • Loop unrolling and software pipelining work well when there are no control statements (if statements) in the loop body -> loop is a single basic block
  • Global code scheduling: scheduling/moving code across branches: larger scheduling scope
  • When can the assignments to B and C be moved before the test?

A[i]=A[i]+B[i]

T

F

A[i]=0?

B[i]=

X

C[i]=

ACA H.Corporaal

which scheduling scope
Which scheduling scope?

Hyperblock/region

Trace

Superblock

Decision Tree

ACA H.Corporaal

comparing scheduling scopes

Trace

Sup.

Hyp.

Dec.

Region

block

block

Tree

Multiple exc. paths

No

No

Yes

Yes

Yes

Side-entries allowed

Yes

No

No

No

No

Join points allowed

Yes

No

Yes

No

Yes

Code motion down joins

Yes

No

No

No

No

Must be if-convertible

No

No

Yes

No

No

Tail dup. before sched.

No

Yes

No

Yes

No

Comparing scheduling scopes

ACA H.Corporaal

scheduling scope creation 1

A

A

C

tail duplication

B

C

B

D

D’

D

E

E’

F

F

E

G

G’

G

Trace

Superblock

Scheduling scope creation (1)

Partitioning a CFG into scheduling scopes:

ACA H.Corporaal

trace scheduling
Trace Scheduling
  • Find the most likely sequence of basic blocks that will be executed consecutively (trace selection)
  • Optimize the trace as much as possible (trace compaction)
    • move operations as early as possible in the trace
    • pack the operations in as few VLIW instructions as possible
    • additional bookkeeping code may be necessary on exit points of the trace

ACA H.Corporaal

scheduling scope creation 2

A

A

tail duplication

C

B

C

B

D

D’

D

E

F’

F

E’

F

E

G

G’

G’’

G

Hyperblock/ region

Decision Tree

Scheduling scope creation (2)

Partitioning a CFG into scheduling scopes:

ACA H.Corporaal

code movement upwards within regions

Legend:

Copy

needed

I

Intermediate

block

Check for

off-liveness

Code

movement

Code movement (upwards) within regions

destination block

I

I

I

I

add

source block

ACA H.Corporaal

hardware support for compile time scheduling
Hardware support for compile-time scheduling
  • Predication
    • (discussed already)
    • see also Itanium example
  • Deferred exceptions
  • Speculative loads

ACA H.Corporaal

predicated instructions discussed before
Predicated Instructions (discussed before)
  • Avoid branch prediction by turning branches into conditional or predicated instructions:

If false, then neither store result nor cause exception

    • Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move; PA-RISC can annul any following instr.
    • IA-64/Itanium: conditional execution of any instruction
  • Examples:

if (R1==0) R2 = R3; CMOVZ R2,R3,R1

if (R1 < R2) SLT R9,R1,R2

R3 = R1; CMOVNZ R3,R1,R9

else CMOVZ R3,R2,R9

R3 = R2;

ACA H.Corporaal

deferred exceptions
Deferred Exceptions
  • ld r1,0(r3) # load A
  • bnez r1,L1 # test A
  • ld r1,0(r2) # then part; load B
  • j L2
  • L1: addi r1,r1,4 # else part; inc A
  • L2: st r1,0(r3) # store A

if (A==0)

A = B;

else

A = A+4;

  • What if the load generates a page fault?
  • What if the load generates an “index-out-of-bounds” exception?
  • How to optimize when then-part is usually selected?

ld r1,0(r3) # load A

ld r9,0(r2) # speculative load B

beqz r1,L3 # test A

addi r9,r1,4 # else part

L3: st r9,0(r3) # store A

ACA H.Corporaal

hw supporting speculative loads
HW supporting Speculative Loads
  • Speculative load (sld): does not generate exceptions
  • Speculation check instruction (speck): check for exception. The exception occurs when this instruction is executed.
  • ld r1,0(r3) # load A
  • sld r9,0(r2) # speculative load of B
  • bnez r1,L1 # test A
  • speck 0(r2) # perform exception check
  • j L2
  • L1: addi r9,r1,4 # else part
  • L2: st r9,0(r3) # store A

ACA H.Corporaal

slide39

Core i7

Next?

3GHz

100W

  • Trends:
  • #transistors follows Moore
  • but not freq. and performance/core

5

ACA H.Corporaal