Limits of instruction level parallelism
This presentation is the property of its rightful owner.
Sponsored Links
1 / 18

Limits of Instruction-Level Parallelism PowerPoint PPT Presentation


  • 66 Views
  • Uploaded on
  • Presentation posted in: General

Presentation by: Robert Duckles CSE 520 Paper being presented: Limits of Instruction-Level Parallelism David W. Wall WRL Research Report, November 1993. Limits of Instruction-Level Parallelism. Instructions that do not have dependencies on each other; can be executed in any order.

Download Presentation

Limits of Instruction-Level Parallelism

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Limits of instruction level parallelism

Presentation by: Robert Duckles

CSE 520

Paper being presented:

Limits of Instruction-Level Parallelism

David W. Wall

WRL Research Report, November 1993

Limits of Instruction-Level Parallelism


What is ilp

Instructions that do not have dependencies on each other; can be executed in any order.

r1 := 0[r9] r1 := 0[r9]

r2 := 17 r2 := r1 + 17

4[r3] := r6 4[r2] := r6

(has ILP) (no ILP)‏

Super-scalar machine – a machine that can issue multiple independent instructions in the same clock cycle.

What is ILP?


Definition of parallelism

Parallelism = (Number of Instructions) /

(Number of Cycles it takes to execute)‏

r1 := 0[r9] r1 := 0[r9]

r2 := 17 r2 := r1 + 17

4[r3] := r6 4[r2] := r6

Parallelism = 3 Parallelism = 1

Definition of Parallelism


How much parallelism is there

That depends how hard you want to look for it...

Ways to increase ILP:

Register renaming

Branch prediction

Alias analysis

Indirect-jump prediction

How much parallelism is there?


Low estimate for ilp

Programs are made up of “basic blocks”—uninterrupted sequences of instructions with no branches.

On average, in typical applications, basic blocks are ~10 instructions long.

Each basic block has parallelism of around 3.

Low estimate for ILP


High estimate for ilp

If you look beyond a basic block, at the entire scope of a program, studies have shown that an “omniscient” scheduler can achieve parallelism of > 1000 in some numerical applications.

“Omniscient” scheduling can be implemented by saving a trace of a program execution, and using an oracle to schedule it. The oracle knows what will happen, and thus can create a perfect execution schedule.

Practical, achievable ILP should be between 3 and 1000.

High estimate for ILP


Types of dependencies

Types of dependencies:

* True dependency - given the computations involved, the dependency must exist

* False dependency - dependency happens to exist as an artifact of the code generation engine. E.g., two independent values are allocated to the same register by the compiler.

r1 := 20[r4] r2 := r1 + r4

... ...

r2 := r1 + 1 r1 := r17 - 1

(a) true data dependency (b) anti-dependency

r1 := r2 * r3 if r17 = 0 goto L

...

... r1 := r2 + r3

...

r1 := 0[r7] L:

(c) output dependency (d) control dependency

Types of dependencies


Register renaming

The compiler's register allocation algorithm can insert false dependencies by assigning unrelated values to the same register.

We can undo this damage by assigning each value to a unique register so that only true dependencies remain.

However, machines have a finite number of registers, so we can never guarantee perfect parallelism.

Register renaming


Register renaming1

Register renaming


Alias analysis

We often have registers that point to a memory location or contain a memory offset. Can two memory pointers point to the same place in memory?

If so, there might be a dependency. We're not sure yet.

We can try to inspect pointer values at runtime to see if they point to overlapping memory.

Alias analysis


Alias analysis1

Alias analysis


Limitations of branch prediction

Limitations of branch prediction:

We can correctly predict around ~0.9 by counting which branches have been recently taken, and taking the most common one.


Indirect jump prediction

If we jump to an address that is not known at compile time--for example, if a destination address is calculated into a register at runtime.

This is often the case for "return" constructs, where the the calling function's address is stored on the stack. In this case, we can do indirect-jump prediction.

Indirect-jump prediction


Latency

Latency

Multi-cycle instructions can greatly decrease parallelism


The window size is the maximum number of instructions that can appear in the pending cycle list

The window size is the maximum number of instructions that can appear in the pending cycle list.

Window size


Overall results

Overall results


Conclusions the ilp wall

Even with “perfect” techniques, most real applications hit an ILP limit of around 20

With reasonable, practical methods, it's even worse—it's very difficult to get an ILP above 10.

Conclusions: the ILP Wall


Relationship to term project

Our term project is about optimization techniques for AMD64 Opteron/Athlon processors.

Maximizing ILP is essential to getting the most performance out of any processor.

Branch prediction, register renaming, etc., are all particularly relevant optimizations

Relationship to Term Project


  • Login