Computer Architecture Principles Dr. Mike Frank

1 / 19

# Computer Architecture Principles Dr. Mike Frank - PowerPoint PPT Presentation

Computer Architecture Principles Dr. Mike Frank. CDA 5155 Summer 2003 Module #18 Scheduling: Basic Static Scheduling & Loop Unrolling, Intro. To Dyn. Sched. Scheduling, Part I. Basic Scheduling Concepts Loop Unrolling. Basic Pipeline Scheduling.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## Computer Architecture Principles Dr. Mike Frank

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Computer Architecture PrinciplesDr. Mike Frank

CDA 5155Summer 2003

Module #18Scheduling: Basic Static Scheduling& Loop Unrolling, Intro. To Dyn. Sched

### Scheduling, Part I

Basic Scheduling ConceptsLoop Unrolling

Basic Pipeline Scheduling
• Basic idea: Reduce control & data stalls by reordering instructions to fill delay slots (e.g. after branch or load instructions), while maintaining program equivalence.
• Depends on data-dependencies within program, and pipeline latencies of various instructions.
Why these latencies?

LD IF ID EX ME WB

SD IF ID EX ME WB

ADDD IF ID A1 A2 A3 A4 ME WB

(delay slot) IF ID …

(delay slot) IF ID …

SD IF ID EX ME WB

FP ALUALU

latency = 3

FP ALUstore

latency = 2

Scheduling Schemes
• Point: To reduce data hazards. Two types:
• Static scheduling: (ch.4)
• Done by compiler
• Instructions reordered at compile-time to fill delay slots with useful instructions
• Problems:
• Some data dependences not known till run-time
• Program binary code tied to pipeline implementation
• Dynamic scheduling: (ch.3)
• Done by the processor
• Reorder instructions at execution time
Loop Scheduling / Unrolling Example
• Source code (with x an array of doubles):
• for(I=1000;I>0;I--) x[I]=x[I]+s;
• Simple RISC assembly:
• Loop: LD F0,0(R1) ;F0=array el. ADDD F4,F0,F2 ;add s in F2 SD 0(R1),F4 ;store result SUBI R1,R1,#8 ;next pointer BNEZ R1,Loop ;loop til I=0

(Some data dependencies shown)

Example Cont.
• Execution without scheduling:Issued on cycleLoop: LD F0,0(R1) 1stall (loadALU latency 1)2 ADDD F4,F0,F2 3stall 4stall 5 SD 0(R1),F4 6 SUBI R1,R1,#8 7stall 8 BNEZ R1,Loop 9stall 10
• 10 cycles per iteration!

Real work

(FP ALUstore latency 2)

(ALUbranch latency 1)

(branch delay 1)

Example with Rescheduling

Issued on cycleLoop: LD F0,0(R1) 1 SUBI R1,R1,#8 2 ADDD F4,F0,F2 3stall 4 BNEZ R1,Loop 5 SD 8(R1),F4 6

• Note: Loop execution time reduced to only 60% of what it was originally!

Real work

Example with Loop Unrolling
• Note:
• This is a 4-fold unroll; n-fold is possible.
• SUBI & BNEZ needed 1/4 as often as previously.
• Multiple offsets used.
• Rescheduling has not yet been done; there will still be a lot of stalls.
• But, use of different registers per unrolled iteration will ease subsequent rescheduling.

Loop: LD F0,0(R1)

SD 0(R1),F4

LD F6,-8(R1)

SD -8(R1),F8

LD F10,-16(R1)

SD -16(R1),F12

LD F14,-24(R1)

SD -24(R1),F16

SUBI R1,R1,#32

BNEZ R1,Loop

28 clock cycles= 7 per elem.

With Unrolling & Scheduling
• Note:
• LD/SD offsets depend on whether instructions are above or below SUBI.
• No stalls! Only 14 cycles per (unrolled) iteration.
• 3.5 cycles per array element! (10/3.5x faster than original)
• Note that the number of overhead cycles per array element went from 7 to ½!
• Would there be much speedup from further unrolling?

Loop: LD F0,0(R1)

LD F6,-8(R1)

LD F10,-16(R1)

LD F14,-24(R1)

SD 0(R1),F4

SD -8(R1),F8

SUBI R1,R1,#32

SD 16(R1),F12

BNEZ R1,Loop

SD 8(R1),F16

Eliminating Data Dependencies

Loop: LD F0,0(R1)

SD 0(R1),F4

SUBI R1,R1,#8

LD F6,0(R1)

SD 0(R1),F8

SUBI R1,R1,#8

LD F10,0(R1)

SD 0(R1),F12

SUBI R1,R1,#8

LD F14,0(R1)

SD 0(R1),F16

SUBI R1,R1,#8

BNEZ R1,Loop

Loop: LD F0,0(R1)

SD 0(R1),F4

LD F6,-8(R1)

SD -8(R1),F8

LD F10,-16(R1)

SD -16(R1),F12

LD F14,-24(R1)

SD -24(R1),F16

SUBI R1,R1,#32

BNEZ R1,Loop

Eliminating Name Dependencies

Loop: LD F0,0(R1)

SD 0(R1),F4

LD F0,-8(R1)

SD -8(R1),F4

LD F0,-16(R1)

SD -16(R1),F4

LD F0,-24(R1)

SD -24(R1),F4

SUBI R1,R1,#32

BNEZ R1,Loop

Loop: LD F0,0(R1)

SD 0(R1),F4

LD F6,-8(R1)

SD -8(R1),F8

LD F10,-16(R1)

SD -16(R1),F12

LD F14,-24(R1)

SD -24(R1),F16

SUBI R1,R1,#32

BNEZ R1,Loop

(Antidependences to SUBI and data dependences not shown)

Eliminating Control Dependencies
• Unrolling example, after loop replication, but before removing branches:

Loop: LD F0,0(R1)

SD 0(R1),F4

SUBI R1,R1,#8

BEQZ R1,Exit

LD F6,0(R1)

SD 0(R1),F8

SUBI R1,R1,#8

BEQZ R1,Exit

LD F10,0(R1)

SD 0(R1),F12

SUBI R1,R1,#8

BEQZ R1,exit

LD F14,0(R1)

SD 0(R1),F16

SUBI R1,R1,#8

BNEZ R1,Loop

Exit:

(Not all control dependencies shown.)

Relaxing Control Dependence
• Only two things must really be preserved:
• Data flow (how a given result is produced)
• Exception behavior
• Some techniques permit removing control dependence from instruction execution, by dependently ignoring instruction results instead.
• Speculation (betting on branches, to fill delay slots)
• Make instructions unconditional if no harm done
• Speculative multiple-execution
• Take both paths, invalidate results of one later
• Conditional/predicated instructions (used in IA-64).
Loop-Level Parallelism (LLP)
• Can use dependence analysis to determine whether all loop iterations may execute in parallel (e.g. on a vector machine).
• A loop-carried dependence is a dependence between loop iterations.
• If present, may sometimes prevent parallelization.
• If absent, loop can be fully parallelized.

### Scheduling, Part II

Introduction to Dynamic Scheduling

Run-Time Data Dependencies
• Are there any data dependences in this code?

SW 100(R1),R6

LW R7,36(R2)

• Yes, but only when 100+R1 = 36+R2.
• Can’t detect this at compile time!
• Values of R1 and R2 may only be computable dynamically.
• Processor could stall the LW after effective-address calculation, if addr. matches that of a previously-issued store not yet completed.

Also may have to worry about overlapping locations, e.g. for a SW and LB

Why Out-of-Order Execution?
• If an instruction is stalled, there’s no need to stall later instructions that aren’t dependent on any of the stalled instructions.
• Example: DIVD F0,F2,F4  Long-running ADDD F10,F0,F8 Depends on DIVD SUBD F12,F8,F14  Independent of both
• The ADDD is stalled before execution, but the SUBD can go ahead.
Splitting Instruction Decode
• Single “Instruction Decode” stage split into 2 parts:
• Instruction Issue
• Determine instruction type
• Check for structural hazards
• Stall instruction until no data hazards
• Release instruction to begin execution
• Need some sort of queue or buffer to hold instructions till their operands are ready.
• Note: Out-of-order completion makes precise exception handling difficult!

Issue

Queue