Computer Architecture Principles Dr. Mike Frank

Computer Architecture PrinciplesDr. Mike Frank CDA 5155Summer 2003 Module #18Scheduling: Basic Static Scheduling& Loop Unrolling, Intro. To Dyn. Sched

Scheduling, Part I Basic Scheduling ConceptsLoop Unrolling

Basic Pipeline Scheduling • Basic idea: Reduce control & data stalls by reordering instructions to fill delay slots (e.g. after branch or load instructions), while maintaining program equivalence. • Depends on data-dependencies within program, and pipeline latencies of various instructions.

Why these latencies? LD IF ID EX ME WB SD IF ID EX ME WB ADDD IF ID A1 A2 A3 A4 ME WB (delay slot) IF ID … (delay slot) IF ID … SD IF ID EX ME WB ADDD IF ID A1 A2… Loadstore latency = 0 LoadALU latency = 1 FP ALUALU latency = 3 FP ALUstore latency = 2

Scheduling Schemes • Point: To reduce data hazards. Two types: • Static scheduling: (ch.4) • Done by compiler • Instructions reordered at compile-time to fill delay slots with useful instructions • Problems: • Some data dependences not known till run-time • Program binary code tied to pipeline implementation • Dynamic scheduling: (ch.3) • Done by the processor • Reorder instructions at execution time

Loop Scheduling / Unrolling Example • Source code (with x an array of doubles): • for(I=1000;I>0;I--) x[I]=x[I]+s; • Simple RISC assembly: • Loop: LD F0,0(R1) ;F0=array el. ADDD F4,F0,F2 ;add s in F2 SD 0(R1),F4 ;store result SUBI R1,R1,#8 ;next pointer BNEZ R1,Loop ;loop til I=0 (Some data dependencies shown)

Example Cont. • Execution without scheduling:Issued on cycleLoop: LD F0,0(R1) 1stall (loadALU latency 1)2 ADDD F4,F0,F2 3stall 4stall 5 SD 0(R1),F4 6 SUBI R1,R1,#8 7stall 8 BNEZ R1,Loop 9stall 10 • 10 cycles per iteration! Real work (FP ALUstore latency 2) (ALUbranch latency 1) (branch delay 1)

Example with Rescheduling Issued on cycleLoop: LD F0,0(R1) 1 SUBI R1,R1,#8 2 ADDD F4,F0,F2 3stall 4 BNEZ R1,Loop 5 SD 8(R1),F4 6 • Note: Loop execution time reduced to only 60% of what it was originally! Real work Loop overhead

Example with Loop Unrolling • Note: • This is a 4-fold unroll; n-fold is possible. • SUBI & BNEZ needed 1/4 as often as previously. • Multiple offsets used. • Rescheduling has not yet been done; there will still be a lot of stalls. • But, use of different registers per unrolled iteration will ease subsequent rescheduling. Loop: LD F0,0(R1) ADDD F4,F0,F2 SD 0(R1),F4 LD F6,-8(R1) ADDD F8,F6,F2 SD -8(R1),F8 LD F10,-16(R1) ADDD F12,F10,F2 SD -16(R1),F12 LD F14,-24(R1) ADDD F16,F14,F2 SD -24(R1),F16 SUBI R1,R1,#32 BNEZ R1,Loop 28 clock cycles= 7 per elem.

With Unrolling & Scheduling • Note: • LD/SD offsets depend on whether instructions are above or below SUBI. • No stalls! Only 14 cycles per (unrolled) iteration. • 3.5 cycles per array element! (10/3.5x faster than original) • Note that the number of overhead cycles per array element went from 7 to ½! • Would there be much speedup from further unrolling? Loop: LD F0,0(R1) LD F6,-8(R1) LD F10,-16(R1) LD F14,-24(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 ADDD F12,F10,F2 ADDD F16,F14,F2 SD 0(R1),F4 SD -8(R1),F8 SUBI R1,R1,#32 SD 16(R1),F12 BNEZ R1,Loop SD 8(R1),F16

Eliminating Data Dependencies Loop: LD F0,0(R1) ADDD F4,F0,F2 SD 0(R1),F4 SUBI R1,R1,#8 LD F6,0(R1) ADDD F8,F6,F2 SD 0(R1),F8 SUBI R1,R1,#8 LD F10,0(R1) ADDD F12,F10,F2 SD 0(R1),F12 SUBI R1,R1,#8 LD F14,0(R1) ADDD F16,F14,F2 SD 0(R1),F16 SUBI R1,R1,#8 BNEZ R1,Loop Loop: LD F0,0(R1) ADDD F4,F0,F2 SD 0(R1),F4 LD F6,-8(R1) ADDD F8,F6,F2 SD -8(R1),F8 LD F10,-16(R1) ADDD F12,F10,F2 SD -16(R1),F12 LD F14,-24(R1) ADDD F16,F14,F2 SD -24(R1),F16 SUBI R1,R1,#32 BNEZ R1,Loop

Eliminating Name Dependencies Loop: LD F0,0(R1) ADDD F4,F0,F2 SD 0(R1),F4 LD F0,-8(R1) ADDD F4,F0,F2 SD -8(R1),F4 LD F0,-16(R1) ADDD F4,F0,F2 SD -16(R1),F4 LD F0,-24(R1) ADDD F4,F0,F2 SD -24(R1),F4 SUBI R1,R1,#32 BNEZ R1,Loop Loop: LD F0,0(R1) ADDD F4,F0,F2 SD 0(R1),F4 LD F6,-8(R1) ADDD F8,F6,F2 SD -8(R1),F8 LD F10,-16(R1) ADDD F12,F10,F2 SD -16(R1),F12 LD F14,-24(R1) ADDD F16,F14,F2 SD -24(R1),F16 SUBI R1,R1,#32 BNEZ R1,Loop (Antidependences to SUBI and data dependences not shown)

Eliminating Control Dependencies • Unrolling example, after loop replication, but before removing branches: Loop: LD F0,0(R1) ADDD F4,F0,F2 SD 0(R1),F4 SUBI R1,R1,#8 BEQZ R1,Exit LD F6,0(R1) ADDD F8,F6,F2 SD 0(R1),F8 SUBI R1,R1,#8 BEQZ R1,Exit LD F10,0(R1) ADDD F12,F10,F2 SD 0(R1),F12 SUBI R1,R1,#8 BEQZ R1,exit LD F14,0(R1) ADDD F16,F14,F2 SD 0(R1),F16 SUBI R1,R1,#8 BNEZ R1,Loop Exit: (Not all control dependencies shown.)

Relaxing Control Dependence • Only two things must really be preserved: • Data flow (how a given result is produced) • Exception behavior • Some techniques permit removing control dependence from instruction execution, by dependently ignoring instruction results instead. • Speculation (betting on branches, to fill delay slots) • Make instructions unconditional if no harm done • Speculative multiple-execution • Take both paths, invalidate results of one later • Conditional/predicated instructions (used in IA-64).

Loop-Level Parallelism (LLP) • Can use dependence analysis to determine whether all loop iterations may execute in parallel (e.g. on a vector machine). • A loop-carried dependence is a dependence between loop iterations. • If present, may sometimes prevent parallelization. • If absent, loop can be fully parallelized.

Scheduling, Part II Introduction to Dynamic Scheduling

Run-Time Data Dependencies • Are there any data dependences in this code? SW 100(R1),R6 LW R7,36(R2) • Answer: It Depends! • Yes, but only when 100+R1 = 36+R2. • Can’t detect this at compile time! • Values of R1 and R2 may only be computable dynamically. • Processor could stall the LW after effective-address calculation, if addr. matches that of a previously-issued store not yet completed. Also may have to worry about overlapping locations, e.g. for a SW and LB

Why Out-of-Order Execution? • If an instruction is stalled, there’s no need to stall later instructions that aren’t dependent on any of the stalled instructions. • Example: DIVD F0,F2,F4  Long-running ADDD F10,F0,F8 Depends on DIVD SUBD F12,F8,F14  Independent of both • The ADDD is stalled before execution, but the SUBD can go ahead.

Splitting Instruction Decode • Single “Instruction Decode” stage split into 2 parts: • Instruction Issue • Determine instruction type • Check for structural hazards • Read Operands • Stall instruction until no data hazards • Read operands • Release instruction to begin execution • Need some sort of queue or buffer to hold instructions till their operands are ready. • Note: Out-of-order completion makes precise exception handling difficult! Issue Queue Read Ops Instruction Decode

Computer Architecture Principles Dr. Mike Frank