Instruction level parallelism
This presentation is the property of its rightful owner.
Sponsored Links
1 / 39

Instruction-level Parallelism PowerPoint PPT Presentation


  • 47 Views
  • Uploaded on
  • Presentation posted in: General

Instruction-level Parallelism. Compiler Perspectives on Code Movement. dependencies are a property of code, whether or not it is a HW hazard depends on the given pipeline. Compiler must respect ( True ) Data dependencies (RAW) Instruction i produces a result used by instruction j, or

Download Presentation

Instruction-level Parallelism

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Instruction level parallelism

Instruction-level Parallelism


Compiler perspectives on code movement

Compiler Perspectives onCode Movement

  • dependencies are a property of code, whether or not it is a HW hazard depends on the given pipeline.

  • Compiler must respect (True) Data dependencies (RAW)

    • Instruction i produces a result used by instruction j, or

    • Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i.


Compiler perspectives on code movement1

Compiler Perspectives onCode Movement

  • Other kinds of dependence also called name (false) dependence: two instructions use same name but don’t exchange data

  • Antidependence (WAR dependence)

    • Instruction j writes a register or memory location that instruction i reads from and instruction i is executed first

  • Output dependence (WAW dependence)

    • Instruction i and instruction j write the same register or memory location; ordering between instructions must be preserved.


Control dependence

Control Dependence

  • Control Dependence

    • Example

      if (c1)

      I1;

      if (c2)

      I2;

      I1 is control dependent on c1 and I2 is control dependent on c2 but not on c1.


A sample loop

A sample loop

Loop:LDF0,0(R1);F0=array element, R1=X[]

MULDF4,F0,F2;multiply scalar in F2

SDF4, 0(R1);store result

ADDIR1,R1,8;increment pointer 8B (DW)

SEQ R3, R1, R2;R2 = &X[1001]

BNEZR3,Loop;branch R3!=zero

NOP;delayed branch slot

Where are the dependencies and stalls?

OperationLatency (stalls)

FP Mult6 (5)

LD2 (1)

Int ALU1 (0)


Instruction scheduling

Instruction Scheduling

Loop:LDF0,0(R1)

MULDF4,F0,F2

SD0(R1),F4

ADDIR1,R1,8

SEQ R3, R1, R2

BNEZR3,Loop

NOP

Number of cycle per iteration?


Instruction scheduling1

Instruction Scheduling

Loop:LDF0,0(R1)

MULDF4,F0,F2

SD0(R1),F4

ADDIR1,R1,8

SEQ R3, R1, R2

BNEZR3,Loop

NOP

Loop:LDF0,0(R1)

ADDIR1,R1,8

MULDF4,F0,F2

SEQ R3, R1, R2

BNEZR3,Loop

SD-8(R1),F4

Cycles/iteration?


Loop unrolling

Loop Unrolling

Loop:LDF0,0(R1)

ADDIR1,R1,8MULDF4,F0,F2

SEQ R3, R1, R2

BNEZR3,Loop

SD-8(R1),F4

Can extract more parallelism


Loop unrolling1

Loop Unrolling

Loop:LDF0,0(R1)

ADDIR1,R1,8MULDF4,F0,F2

SEQ R3, R1, R2

BNEZR3,Loop

SD-8(R1),F4

Loop:LDF0,0(R1)

ADDIR1,R1,8MULDF4,F0,F2

SEQ R3, R1, R2

BNEZR3,Loop

SD-8(R1),F4

LDF0,0(R1)

ADDIR1,R1,8MULDF4,F0,F2

SEQ R3, R1, R2

BNEZR3,Loop

SD-8(R1),F4

What is the problem here?


Loop unrolling2

Loop Unrolling

Loop:LDF0,0(R1)

ADDIR1,R1,8MULDF4,F0,F2

SEQ R3, R1, R2

BNEZR3,Loop

SD-8(R1),F4

Loop:LDF0,0(R1)

ADDIR1,R1,8MULDF4,F0,F2

SEQ R3, R1, R2

BNEZR3,Loop

SD-8(R1),F4

LDF0,0(R1)

ADDIR1,R1,8MULDF4,F0,F2

SEQ R3, R1, R2

BNEZR3,Loop

SD-8(R1),F4

Unnecessary instructions and redundant instructions


Loop unrolling3

Loop Unrolling

Loop:LDF0,0(R1)

ADDIR1,R1,8MULDF4,F0,F2

SEQ R3, R1, R2

BNEZR3,Loop

SD-8(R1),F4

Loop:LDF0,0(R1)

MULDF4,F0,F2

SD0(R1),F4

LDF0,8(R1)

ADDIR1,R1,16MULDF4,F0,F2

SEQ R3, R1, R2

BNEZR3,Loop

SD-8(R1),F4

Hint

Still problems with scheduling?


Register renaming

Register Renaming

Loop:LDF0,0(R1)

ADDIR1,R1,8MULDF4,F0,F2

SEQ R3, R1, R2

BNEZR3,Loop

SD-8(R1),F4

Loop:LDF0,0(R1)

MULDF4,F0,F2

SD0(R1),F4

LDF10,8(R1)

ADDIR1,R1,16MULDF14,F10,F2

SEQ R3, R1, R2

BNEZR3,Loop

SD-8(R1),F14

Let’s schedule now


Register renaming1

Register Renaming

Loop:LDF0,0(R1)

ADDIR1,R1,8MULDF4,F0,F2

SEQ R3, R1, R2

BNEZR3,Loop

SD-8(R1),F4

Loop:LDF0,0(R1)

LDF10,8(R1)MULDF4,F0,F2

MULDF14,F10,F2ADDIR1,R1,16

SEQ R3, R1, R2

SD0(R1),F4BNEZR3,Loop

SD-8(R1),F14

Cycles/iteration?


How easy is it to determine dependences

How easy is it to determine dependences?

  • Easy to determine for registers (fixed names)

  • Hard for memory:

    • Does 100(R4) = 20(R6)?

    • From different loop iterations, does 20(R6) = 20(R6)?

  • Another Example:

    ST R5, R6

    LD R4, R3


Memory disambiguation

Memory Disambiguation

  • Problem: In many cases, it is likely but not certain that two memory instructions reference different addresses

    • Disambiguation is much harder in languages with pointers

  • Example:

    void annoy_compiler1(char *foo, char *bar) {

    foo[2] = bar[2];

    bar[3] = foo[3];

    }

    Memory references are independent unless foo = bar


Disambiguation 2

Disambiguation 2

  • Making things worse, some programs have independent memory references some of the time

  • Example:

    void annoy_compiler2(int *a, int *b) {

    int I;

    for (I = 0; I < 256; I++){

    a[I] = b[f(I)];

    }

    }

  • Conventional compiler needs to assume that any references that could be to the same location are to the same location and serialize them


Hw schemes instruction parallelism

HW Schemes: Instruction Parallelism

  • Why in HW at run time?

    • Works when can’t know dependence until run time

      • Variable latency

      • Control dependent data dependence

    • Can schedule differently every time through the code.

    • Compiler simpler

    • Code for one machine runs well on another

  • Hardware techniques to find/extract ILP

    • Tomasulo’s Algorithm for Out-of-order Execution


Tomasulo s algorithm

Tomasulo’s Algorithm

  • Developed for architecture of IBM 360/91 (1967)

    • 360/91 system’s goal was to significantly improve performance (especially floating-point) without requiring people to change their code

      • Sound familiar?

16MHz

2MB Mem

50X faster

Than SOA


Tomasulo organization

Tomasulo Organization


Tomasulo algorithm

Tomasulo Algorithm

  • Consider three input instructions

  • Common Data Bus broadcasts results to all FUs

    RS’s (FU’s), registers, etc. responsible for collecting own data off CDB

  • Load and Store Queues treated as FUs as well


Reservation station components

Reservation Station Components

Op—Operation to perform in the unit (e.g., + or –)

Qj, Qk—Reservation stations producing source registers

Vj, Vk—Value of Source operands

Rj, Rk—Flags indicating when Vj, Vk are ready

Busy—Indicates reservation station is busy

Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register.


Three stages of tomasulo algorithm

Three Stages of Tomasulo Algorithm

1.Issue—get instruction from FP Op Queue

If reservation station free, the scoreboard issues instr & sends operands (renames registers).

2.Execution—operate on operands (EX)

When both operands ready then execute; if not ready, watch CDB for result

3.Write result—finish execution (WB)

Write on Common Data Bus to all waiting units; mark reservation station available.


Tomasulo example

Tomasulo Example

ADDDF4, F2, F0

MULDF8, F4, F2

ADDDF6, F8, F6

SUBDF8, F2, F0

ADDDF2, F8, F0

Multiply takes 10 clocks, add/sub take 4


Tomasulo cycle 0

Tomasulo – cycle 0

Instruction Queue

ADDDF4, F2, F0

MULDF8, F4, F2

ADDDF6, F8, F6

SUBDF8, F2, F0

ADDDF2, F8, F0

F0

F2

F4

F6

F8

0.0

ADDD F2, F8, F0

2.0

SUBD F8, F2, F0

4.0

ADDD F6, F8, F6

6.0

8.0

MULD F8, F4, F2

ADDD F4, F2, F0

1

2

3

1

2

FP adders

FP mult’s


Tomasulo cycle 1

Tomasulo – cycle 1

Instruction Queue

ADDDF4, F2, F0

MULDF8, F4, F2

ADDDF6, F8, F6

SUBDF8, F2, F0

ADDDF2, F8, F0

F0

F2

F4

F6

F8

0.0

2.0

ADDD F2, F8, F0

4.0

add1

SUBD F8, F2, F0

6.0

8.0

ADDD F6, F8, F6

MULD F8, F4, F2

1

2

3

ADDD

2.0

0.0

1

2

FP adders

FP mult’s


Tomasulo cycle 2

Tomasulo – cycle 2

Op Qj Qk Vj Vk Busy

MULD add1 - - 2.0 Y

Instruction Queue

ADDDF4, F2, F0

MULDF8, F4, F2

ADDDF6, F8, F6

SUBDF8, F2, F0

ADDDF2, F8, F0

F0

F2

F4

F6

F8

0.0

2.0

4.0

add1

ADDD F2, F8, F0

6.0

8.0

mult1

SUBD F8, F2, F0

ADDD F6, F8, F6

1

2

3

ADDD

2.0

0.0

1

2

MULD

add1

2.0

FP adders

FP mult’s


Tomasulo cycle 21

Tomasulo – cycle 2

Instruction Queue

ADDDF4, F2, F0

MULDF8, F4, F2

ADDDF6, F8, F6

SUBDF8, F2, F0

ADDDF2, F8, F0

F0

F2

F4

F6

F8

0.0

2.0

4.0

add1

ADDD F2, F8, F0

6.0

8.0

mult1

SUBD F8, F2, F0

ADDD F6, F8, F6

1

2

3

ADDD

2.0

0.0

1

2

MULD

add1

2.0

FP adders

FP mult’s


Tomasulo cycle 3

Tomasulo – cycle 3

Instruction Queue

ADDDF4, F2, F0

MULDF8, F4, F2

ADDDF6, F8, F6

SUBDF8, F2, F0

ADDDF2, F8, F0

F0

F2

F4

F6

F8

0.0

2.0

4.0

add1

6.0

add2

8.0

mult1

ADDD F2, F8, F0

SUBD F8, F2, F0

1

2

3

ADDD

2.0

0.0

1

2

ADDD

mult1

6.0

MULD

add1

2.0

FP adders

FP mult’s


Tomasulo cycle 4

Tomasulo – cycle 4

Instruction Queue

ADDDF4, F2, F0

MULDF8, F4, F2

ADDDF6, F8, F6

SUBDF8, F2, F0

ADDDF2, F8, F0

F0

F2

F4

F6

F8

0.0

2.0

4.0

add1

6.0

add2

8.0

add3

ADDD F2, F8, F0

1

2

3

ADDD

2.0

0.0

1

2

ADDD

mult1

6.0

MULD

add1

2.0

SUBD

2.0

0.0

FP adders

FP mult’s


Tomasulo cycle 5

Tomasulo – cycle 5

Instruction Queue

ADDDF4, F2, F0

MULDF8, F4, F2

ADDDF6, F8, F6

SUBDF8, F2, F0

ADDDF2, F8, F0

F0

F2

F4

F6

F8

0.0

2.0

2.0

-

6.0

add2

8.0

add3

ADDD F2, F8, F0

1

2

3

ADDD

2.0

0.0

1

2

ADDD

mult1

6.0

MULD

2.0

2.0

SUBD

2.0

0.0

FP adders

FP mult’s

2.0 (add1 result)


Tomasulo cycle 6

Tomasulo – cycle 6

Instruction Queue

ADDDF4, F2, F0

MULDF8, F4, F2

ADDDF6, F8, F6

SUBDF8, F2, F0

ADDDF2, F8, F0

F0

F2

F4

F6

F8

0.0

2.0

add1

2.0

-

6.0

add2

8.0

add3

1

2

3

ADDD

add3

0.0

1

2

ADDD

mult1

6.0

MULD

2.0

2.0

SUBD

2.0

0.0

FP adders

FP mult’s


Tomasulo cycle 8

Tomasulo – cycle 8

Instruction Queue

ADDDF4, F2, F0

MULDF8, F4, F2

ADDDF6, F8, F6

SUBDF8, F2, F0

ADDDF2, F8, F0

F0

F2

F4

F6

F8

0.0

2.0

add1

2.0

-

6.0

add2

2.0

-

1

2

3

ADDD

2.0

0.0

1

2

ADDD

mult1

6.0

MULD

2.0

2.0

SUBD

2.0

0.0

FP adders

FP mult’s

2.0 (add3 result)


Tomasulo cycle 9

Tomasulo – cycle 9

Instruction Queue

ADDDF4, F2, F0

MULDF8, F4, F2

ADDDF6, F8, F6

SUBDF8, F2, F0

ADDDF2, F8, F0

F0

F2

F4

F6

F8

0.0

2.0

add1

2.0

6.0

add2

2.0

1

2

3

ADDD

2.0

0.0

1

2

ADDD

mult1

6.0

MULD

2.0

2.0

FP adders

FP mult’s


Tomasulo cycle 12

Tomasulo – cycle 12

Instruction Queue

ADDDF4, F2, F0

MULDF8, F4, F2

ADDDF6, F8, F6

SUBDF8, F2, F0

ADDDF2, F8, F0

F0

F2

F4

F6

F8

0.0

2.0

-

2.0

6.0

add2

2.0

1

2

3

ADDD

2.0

0.0

1

2

ADDD

mult1

6.0

MULD

2.0

2.0

FP adders

FP mult’s

2.0 (add1 result)


Tomasulo cycle 15

Tomasulo – cycle 15

Instruction Queue

ADDDF4, F2, F0

MULDF8, F4, F2

ADDDF6, F8, F6

SUBDF8, F2, F0

ADDDF2, F8, F0

F0

F2

F4

F6

F8

0.0

2.0

-

2.0

6.0

add2

2.0

1

2

3

1

2

ADDD

4.0

6.0

MULD

2.0

2.0

FP adders

FP mult’s

4.0 (mult1 result)


Tomasulo cycle 16

Tomasulo – cycle 16

Instruction Queue

ADDDF4, F2, F0

MULDF8, F4, F2

ADDDF6, F8, F6

SUBDF8, F2, F0

ADDDF2, F8, F0

F0

F2

F4

F6

F8

0.0

2.0

-

2.0

6.0

add2

2.0

1

2

3

1

2

ADDD

4.0

6.0

FP adders

FP mult’s


Tomasulo cycle 19

Tomasulo – cycle 19

Instruction Queue

ADDDF4, F2, F0

MULDF8, F4, F2

ADDDF6, F8, F6

SUBDF8, F2, F0

ADDDF2, F8, F0

F0

F2

F4

F6

F8

0.0

2.0

2.0

10.0

-

2.0

1

2

3

1

2

ADDD

4.0

6.0

FP adders

FP mult’s

10.0 (add2 result)


Tomasulo summary

Tomasulo Summary

  • Prevents Register as bottleneck

  • Avoids WAR, WAW hazards

  • Lasting Contributions

    • Dynamic scheduling

    • Register renaming (in what way does the register name change?)

    • Load/store disambiguation


Limitations

Limitations

  • Exceptions/interrupts

    • Can’t identify a particular point in the program at which an interrupt/exception occurs

    • How do you know where to go back to after an interrupt handler completes?

    • OOO completion???

  • Interaction with pipelined ALUs

    • Reservation station couldn’t be released until instruction completes, would need many reservation stations.


  • Login