Instruction Level Parallelism

1 / 18

# Instruction Level Parallelism - PowerPoint PPT Presentation

Instruction Level Parallelism. MUX. Data Cache. Pipeline + Forwarding. Register Bank. Instruction Cache. PC. ALU. Instruction Level Parallelism. Instruction Level Parallelism (ILP). Suppose we have an expression of the form x = (a+b) * (c-d)

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Instruction Level Parallelism' - diallo

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Instruction Level Parallelism

MUX

Data

Cache

Pipeline + Forwarding

Register Bank

Instruction

Cache

PC

ALU

### Instruction Level Parallelism

Instruction Level Parallelism (ILP)
• Suppose we have an expression of the form x = (a+b) * (c-d)
• Assuming a,b,c & d are in registers, this might turn into

SUB R1, R4, R5

MUL R0, R0, R1

STR R0, x

ILP (cont)
• The MUL has a dependence on the ADD and the SUB, and the STR has a dependence on the MUL
• However, the ADD and SUB are independent
• In theory, we could execute them in parallel, even out of order

SUB R1, R4, R5

MUL R0, R0, R1

STR R0, x

The Data Flow Graph
• We can see this more clearly if we draw the data flow graph

R2 R3 R4 R5

As long as R2, R3,

R4 & R5 are available,

We can execute the

SUB

MUL

x

Amount of ILP?
• This is obviously a very simple example
• However, real programs often have quite a few independent instructions which could be executed in parallel
• Exact number is clearly program dependent but analysis has shown that maybe 4 is not uncommon (in parts of the program anyway).
How to Exploit?
• We need to fetch multiple instructions per cycle – wider instruction fetch
• Need to decode multiple instructions per cycle
• But must use common registers – they are logically the same registers
• Need multiple ALUs for execution
• But also access common data cache
Dual Issue Pipeline Structure
• Two instructions can now execute in parallel
• (Potentially) double the execution rate
• Called a ‘Superscalar’ architecture

I1 I2

Instruction

Cache

Register Bank

MUX

ALU

Data

Cache

PC

MUX

ALU

Register & Cache Access
• Note the access rate to both registers & cache will be doubled
• To cope with this we may need a dual ported register bank & dual ported cache.
• This can be done either by duplicating access circuitry or even duplicating whole register & cache structure
Selecting Instructions
• To get the doubled performance out of this structure, we need to have independent instructions
• We can have a ‘dispatch unit’ in the fetch stage which uses hardware to examine the instruction dependencies and only issue two in parallel if they are independent
Instruction order

MUL R0,R1,R1

MUL R4,R3,R3

• Issued in pairs as above
• We wouldn’t be able to issue any in parallel because of dependencies
Compiler Optimisation
• But if the compiler had examined dependencies and produced

MUL R0,R1,R1

MUL R4,R3,R3

• We can now execute pairs in parallel (assuming appropriate forwarding logic)
Relying on the Compiler
• If compiler can’t manage to reorder the instructions, we still need hardware to avoid issuing conflicts
• But if we could rely on the compiler, we could get rid of expensive checking logic
• This is the principle of VLIW (Very Long Instruction Word)
• Compiler must add NOPs if necessary
Out of Order Execution
• There are arguments against relying on the compiler
• Legacy binaries – optimum code tied to a particular hardware configuration
• ‘Code Bloat’ in VLIW – useless NOPs
• Instead rely on hardware to re-order instructions if necessary
• Complex but effective
Out of Order Execution
• Processor Level OOE can only happen once the instructions are loaded.

I1 I2

Instruction

Cache

Register Bank

MUX

ALU

Data

Cache

PC

MUX

ALU

Programmer Assisted ILP / Vector Instructions
• Linear Algebra operations such as Vector Product, Matrix Multiplication have LOTS of parallelism
• This can be hard to detect in languages like C
• Instructions can be too separated for hardware detection.
• Programmer can use types such as float4
Limits of ILP
• Modern processors are up to 4 way superscalar (but rarely achieve 4x speed)
• Not much beyond this
• Hardware complexity
• Limited amounts of ILP in real programs
• Limited ILP not surprising, conventional programs are written assuming a serial execution model – what next?