computer architecture principles dr mike frank n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Computer Architecture Principles Dr. Mike Frank PowerPoint Presentation
Download Presentation
Computer Architecture Principles Dr. Mike Frank

Loading in 2 Seconds...

play fullscreen
1 / 19

Computer Architecture Principles Dr. Mike Frank - PowerPoint PPT Presentation


  • 77 Views
  • Uploaded on

Computer Architecture Principles Dr. Mike Frank. CDA 5155 Summer 2003 Module #18 Scheduling: Basic Static Scheduling & Loop Unrolling, Intro. To Dyn. Sched. Scheduling, Part I. Basic Scheduling Concepts Loop Unrolling. Basic Pipeline Scheduling.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Computer Architecture Principles Dr. Mike Frank' - kaoru


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
computer architecture principles dr mike frank

Computer Architecture PrinciplesDr. Mike Frank

CDA 5155Summer 2003

Module #18Scheduling: Basic Static Scheduling& Loop Unrolling, Intro. To Dyn. Sched

scheduling part i

Scheduling, Part I

Basic Scheduling ConceptsLoop Unrolling

basic pipeline scheduling
Basic Pipeline Scheduling
  • Basic idea: Reduce control & data stalls by reordering instructions to fill delay slots (e.g. after branch or load instructions), while maintaining program equivalence.
  • Depends on data-dependencies within program, and pipeline latencies of various instructions.
why these latencies
Why these latencies?

LD IF ID EX ME WB

SD IF ID EX ME WB

ADDD IF ID A1 A2 A3 A4 ME WB

(delay slot) IF ID …

(delay slot) IF ID …

SD IF ID EX ME WB

ADDD IF ID A1 A2…

Loadstore latency = 0

LoadALU latency = 1

FP ALUALU

latency = 3

FP ALUstore

latency = 2

scheduling schemes
Scheduling Schemes
  • Point: To reduce data hazards. Two types:
  • Static scheduling: (ch.4)
    • Done by compiler
    • Instructions reordered at compile-time to fill delay slots with useful instructions
    • Problems:
      • Some data dependences not known till run-time
      • Program binary code tied to pipeline implementation
  • Dynamic scheduling: (ch.3)
    • Done by the processor
    • Reorder instructions at execution time
loop scheduling unrolling example
Loop Scheduling / Unrolling Example
  • Source code (with x an array of doubles):
    • for(I=1000;I>0;I--) x[I]=x[I]+s;
  • Simple RISC assembly:
    • Loop: LD F0,0(R1) ;F0=array el. ADDD F4,F0,F2 ;add s in F2 SD 0(R1),F4 ;store result SUBI R1,R1,#8 ;next pointer BNEZ R1,Loop ;loop til I=0

(Some data dependencies shown)

example cont
Example Cont.
  • Execution without scheduling:Issued on cycleLoop: LD F0,0(R1) 1stall (loadALU latency 1)2 ADDD F4,F0,F2 3stall 4stall 5 SD 0(R1),F4 6 SUBI R1,R1,#8 7stall 8 BNEZ R1,Loop 9stall 10
  • 10 cycles per iteration!

Real work

(FP ALUstore latency 2)

(ALUbranch latency 1)

(branch delay 1)

example with rescheduling
Example with Rescheduling

Issued on cycleLoop: LD F0,0(R1) 1 SUBI R1,R1,#8 2 ADDD F4,F0,F2 3stall 4 BNEZ R1,Loop 5 SD 8(R1),F4 6

  • Note: Loop execution time reduced to only 60% of what it was originally!

Real work

Loop overhead

example with loop unrolling
Example with Loop Unrolling
  • Note:
  • This is a 4-fold unroll; n-fold is possible.
  • SUBI & BNEZ needed 1/4 as often as previously.
  • Multiple offsets used.
  • Rescheduling has not yet been done; there will still be a lot of stalls.
  • But, use of different registers per unrolled iteration will ease subsequent rescheduling.

Loop: LD F0,0(R1)

ADDD F4,F0,F2

SD 0(R1),F4

LD F6,-8(R1)

ADDD F8,F6,F2

SD -8(R1),F8

LD F10,-16(R1)

ADDD F12,F10,F2

SD -16(R1),F12

LD F14,-24(R1)

ADDD F16,F14,F2

SD -24(R1),F16

SUBI R1,R1,#32

BNEZ R1,Loop

28 clock cycles= 7 per elem.

with unrolling scheduling
With Unrolling & Scheduling
  • Note:
  • LD/SD offsets depend on whether instructions are above or below SUBI.
  • No stalls! Only 14 cycles per (unrolled) iteration.
  • 3.5 cycles per array element! (10/3.5x faster than original)
  • Note that the number of overhead cycles per array element went from 7 to ½!
  • Would there be much speedup from further unrolling?

Loop: LD F0,0(R1)

LD F6,-8(R1)

LD F10,-16(R1)

LD F14,-24(R1)

ADDD F4,F0,F2

ADDD F8,F6,F2

ADDD F12,F10,F2

ADDD F16,F14,F2

SD 0(R1),F4

SD -8(R1),F8

SUBI R1,R1,#32

SD 16(R1),F12

BNEZ R1,Loop

SD 8(R1),F16

eliminating data dependencies
Eliminating Data Dependencies

Loop: LD F0,0(R1)

ADDD F4,F0,F2

SD 0(R1),F4

SUBI R1,R1,#8

LD F6,0(R1)

ADDD F8,F6,F2

SD 0(R1),F8

SUBI R1,R1,#8

LD F10,0(R1)

ADDD F12,F10,F2

SD 0(R1),F12

SUBI R1,R1,#8

LD F14,0(R1)

ADDD F16,F14,F2

SD 0(R1),F16

SUBI R1,R1,#8

BNEZ R1,Loop

Loop: LD F0,0(R1)

ADDD F4,F0,F2

SD 0(R1),F4

LD F6,-8(R1)

ADDD F8,F6,F2

SD -8(R1),F8

LD F10,-16(R1)

ADDD F12,F10,F2

SD -16(R1),F12

LD F14,-24(R1)

ADDD F16,F14,F2

SD -24(R1),F16

SUBI R1,R1,#32

BNEZ R1,Loop

eliminating name dependencies
Eliminating Name Dependencies

Loop: LD F0,0(R1)

ADDD F4,F0,F2

SD 0(R1),F4

LD F0,-8(R1)

ADDD F4,F0,F2

SD -8(R1),F4

LD F0,-16(R1)

ADDD F4,F0,F2

SD -16(R1),F4

LD F0,-24(R1)

ADDD F4,F0,F2

SD -24(R1),F4

SUBI R1,R1,#32

BNEZ R1,Loop

Loop: LD F0,0(R1)

ADDD F4,F0,F2

SD 0(R1),F4

LD F6,-8(R1)

ADDD F8,F6,F2

SD -8(R1),F8

LD F10,-16(R1)

ADDD F12,F10,F2

SD -16(R1),F12

LD F14,-24(R1)

ADDD F16,F14,F2

SD -24(R1),F16

SUBI R1,R1,#32

BNEZ R1,Loop

(Antidependences to SUBI and data dependences not shown)

eliminating control dependencies
Eliminating Control Dependencies
  • Unrolling example, after loop replication, but before removing branches:

Loop: LD F0,0(R1)

ADDD F4,F0,F2

SD 0(R1),F4

SUBI R1,R1,#8

BEQZ R1,Exit

LD F6,0(R1)

ADDD F8,F6,F2

SD 0(R1),F8

SUBI R1,R1,#8

BEQZ R1,Exit

LD F10,0(R1)

ADDD F12,F10,F2

SD 0(R1),F12

SUBI R1,R1,#8

BEQZ R1,exit

LD F14,0(R1)

ADDD F16,F14,F2

SD 0(R1),F16

SUBI R1,R1,#8

BNEZ R1,Loop

Exit:

(Not all control dependencies shown.)

relaxing control dependence
Relaxing Control Dependence
  • Only two things must really be preserved:
    • Data flow (how a given result is produced)
    • Exception behavior
  • Some techniques permit removing control dependence from instruction execution, by dependently ignoring instruction results instead.
    • Speculation (betting on branches, to fill delay slots)
      • Make instructions unconditional if no harm done
    • Speculative multiple-execution
      • Take both paths, invalidate results of one later
    • Conditional/predicated instructions (used in IA-64).
loop level parallelism llp
Loop-Level Parallelism (LLP)
  • Can use dependence analysis to determine whether all loop iterations may execute in parallel (e.g. on a vector machine).
  • A loop-carried dependence is a dependence between loop iterations.
    • If present, may sometimes prevent parallelization.
    • If absent, loop can be fully parallelized.
scheduling part ii

Scheduling, Part II

Introduction to Dynamic Scheduling

run time data dependencies
Run-Time Data Dependencies
  • Are there any data dependences in this code?

SW 100(R1),R6

LW R7,36(R2)

  • Answer: It Depends!
    • Yes, but only when 100+R1 = 36+R2.
  • Can’t detect this at compile time!
    • Values of R1 and R2 may only be computable dynamically.
  • Processor could stall the LW after effective-address calculation, if addr. matches that of a previously-issued store not yet completed.

Also may have to worry about overlapping locations, e.g. for a SW and LB

why out of order execution
Why Out-of-Order Execution?
  • If an instruction is stalled, there’s no need to stall later instructions that aren’t dependent on any of the stalled instructions.
  • Example: DIVD F0,F2,F4  Long-running ADDD F10,F0,F8 Depends on DIVD SUBD F12,F8,F14  Independent of both
  • The ADDD is stalled before execution, but the SUBD can go ahead.
splitting instruction decode
Splitting Instruction Decode
  • Single “Instruction Decode” stage split into 2 parts:
    • Instruction Issue
      • Determine instruction type
      • Check for structural hazards
    • Read Operands
      • Stall instruction until no data hazards
      • Read operands
      • Release instruction to begin execution
  • Need some sort of queue or buffer to hold instructions till their operands are ready.
  • Note: Out-of-order completion makes precise exception handling difficult!

Issue

Queue

Read Ops

Instruction Decode