ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

ECE 4100/6100Advanced Computer ArchitectureLecture 15 Static Scheduling Machines Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology

Static Scheduling • Compiler performs instruction scheduling • VLIW  Very Long Instruction Word • An alternative to dynamic scheduling processors • Pack multiple operations into one instruction • Move scheduling to Compiler (Software Approach) • Can simplify the complexity of a hardware-based instruction scheduler • Cydrome, Multiflow, EPIC

Very Long Instruction Word (VLIW) • Rely on Compilers • Simple Hardware • Dependency is explicitly represented in the instructions • Instruction window, supposedly, is much larger than a hardware scheduling window • How about loop boundary? • How about function boundary? • Interprocedural optimization is generally difficult • Might lead to compatibility or performance issues if instruction latency changed • EPIC/Itanium closely follows VLIW philosophy, many embedded and DSP processors embrace VLIW

Instruction Slot 1 Instruction Slot 2 Instruction Slot 3 Templt Intel Itanium ISA • Itanium Instruction “Bundle” (VLIW) • 128 bits each • Contains three Itanium instructions (aka syllables) • Template bits in each bundle specify dependencies both within a bundle as well as between sequential bundles • A collection of independent bundles forms a “group” (use stops) • Each Itanium Instruction • Fixed-length 41 bits long • Left-most 4 bits (40-37) are the major opcode (e.g. FP ld/st, INT ld/st, ALU) • Contains max three 7-bit register specifiers • Contains a 6-bit field for specifying one of the 64 one-bit qualifying predicate registers 127 86 45 5 0 4

Encoding Instruction Bundle { .mii ld4 r28=[r8] add r9 = 2,r1;; add r30= 1,r9 } MI_I format  Template encoded “02” • Use “;;” as “stop bit” in assembly code to separate dependent instructions • Instructions between “;;” belong to the same “instruction group” • RAW and WAW are not allowed in the same instruction group • WAR is allowed except for an special case: when writing p63 by modulo-scheduled branch (e.g. br.ctop) after reading p63 (e.g. qualifying predicate) by B-type instruction • Each instruction slot can represent one (out of 5) functional unit type based on encoding (e.g. slot 0 can be M-unit or B-unit) • 12 basic templates provided, each with 2 versions depending on stop bit • MII, MI_I, MLX, MMI, M_MI, MFI, MMF, MIB, MBB, BBB, MMB, MFB • MII_, MI_I_, MLX_, MMI_, M_MI_, MFI_, MMF_, MIB_, MBB_, BBB_, MMB_, MFB_

Itanium Instruction Example { .mii add r1 = r2, r3 sub r4 = r4, r5;; shr r1, r4, r1;; } { .mmi ld8 r2, [r1];; st8 [r1] = r23 tbit p1,p2 = r4, 5 } { .mbb ld8 r45 = [r55] (p3)br.call b1=func1 (p4)br.cond Label1 } { .mfi st4 [r45] = r6 fmac f1=f2,f3 add r3=r3, 8;; }

Itanium Register Files 0 0 81 63 127 127 Stacked (Rotating) Stacked (Rotating) Stacked (Rotating) 0 63 32 32 31 31 Static Static Static 16 15 0 0 0 Predicate Registers General Purpose Registers FP Registers

sof sol Register Stack Engine 127 • Avoid spills/fills during function call/return • Callee uses instruction alloc r1=ar.pfs, i, l, o, r upon entering a function illegal size of frame (sof) outputs size of locals (sol = i+l) locals size of rotating (sor) (inputs) 32 31 Static 0 Current Frame Marker (CFM) 38 bits rrb.pr rrb.fr rrb.gr sor

r127 r38 r33 b[i] r32 i*i r43 r32 GPR Callee (foo) Function Call Example r127 main(){ a=foo(i*i, b[i]); } int foo(int ii, int bb) { } r45 b[i] r44 i*i r43 main: alloc r32=ar.pfs,0,12,2,0 foo: alloc r26=ar.pfs,2,5,0,0 r32 GPR Caller (main)

52 out 46 38 out loc 32 call 32 sol sof sol sof 0 7 CFM 14 21 PFS.pfm x x 14 21 RSE: A Function Call pfm: Previous frame marker

50 out 48 loc 52 out 46 32 38 out loc 32 call alloc r32=ar.pfs,7,9,3,0 32 sol sol sof sof sol sof 16 0 19 7 CFM 14 21 PFS.pfm x x 14 14 21 21 RSE: Alloc inputs alloc copies PFM to GR (r32)

50 out 48 loc 52 52 out out 46 46 32 38 out loc loc 32 return call alloc 32 32 sol sol sol sof sof sof sol sof 14 16 0 19 21 7 CFM 14 21 PFS.pfm x x 14 14 14 21 21 21 RSE: Return

Itanium Pipelines Ckt improved Front-end Dependency Scoreboard Stall checked here prior to EXE • Performance improvement due to pipeline shortening — 4% to 6% • Large integer register file cause extra stage WLD (Word Line Decode) in Itanium, circuit improved for Itanium 2 • Inter-group latency is enforced by a scoreboard • Latency due to scheduling that failed to space instructions out • Due to cache misses

Itanium 2 Eight-stage Pipeline FP FP1 FP2 FP3 FP4 WB Core IPG ROT EXP REN REG EXE DET WB L2N L2I L2A L2M L2D L2C L2W L2

L1 I-Cache & Fetch/Prefetch engine I-TLB B B B M M M M I I F F Itanium 2 Microarchitecture IA-32 Decode & Control Branch Prediction Instruction Queue 8 bundles 11 issue ports Register stack engine / remapping On-chip PIPT Unified L3 Cache Single-ported (ECC) PIPT Unified L2 Cache Quad-Port (ECC) Branch & Predicate 128 INT Registers 128 FP Registers Scoreboard, Predicate NaT, Exceptions Branch Units INT & MM Units Quad-port (INT) L1 PIPT Data Cache (WT)D-TLB Floating Point Units Branch Units INT & MM Units Floating Point Units ALAT Branch Units INT & MM Units INT & MM Units INT & MM Units INT & MM Units Bus Controller (ECC)

Control Speculation (Speculative Load) • To improve memory latency by control speculation at compile time • Defer exceptions by setting NaT (GR’s 65th bit) that indicates: • Whether or not an exception has occurred • Branch to fixup code required • NaT set during ld.s, checked by chk.s Conventional Architectures Itanium ld.s instr 1 instr 2 instr 1 instr 2 . . . br Barrier br Load use chk.s use Elevate loads above a branch

Control Speculation (Hoist Uses) IA-64 ld.s instr 1 instr 2 br chk.s use • The uses of speculative data can be executed speculatively • Distinguishes speculation from simple prefetch • NaT bit propagates down to the dependent instruction chain

Control Speculation (Recovery) • All computation instructions propagate NaTs to the consumers to reduce number of checks • Cmp propagates “false” if NaT is set when writing predicates (“0” for both target predicates) ld8.s r3 = (r9) ld8.s r4 = (r10) add r6 = r3, r4 ld8.s r5 = (r6) p1,p2 = cmp(...) Recovery code ld8 ld8 add ld8 br home chk.s r5, recv sub r7 = r5,r2 Allows single chk on result

Data Speculation (Advanced Loads) • Compiler can hoist a load prior to a preceding, possibly-conflicting store • ALAT (Advanced Load Address Table) is used for checking every store address in-between • Can be done by superscalar machine using Store coloring Conventional Architectures Itanium instr 1 ld8.a instr 1 instr 2 instr 2 . . . st8 Barrier st8 ld8 use ld.c use

ld8.a r3= instr 1 add =r3, instr 2 st8 Recovery code ld8 r3= add =r3, br L1 chk.a L1: Data Speculation (load.a + chk.a) • Compiler hoist a load and its subsequent consumers prior to a preceding, possibly-conflicting store • Need to patch a recovery code for mis-speculation ld8.a r3= instr 1 instr 2 st8 ld.c add =r3,

Parallel Compare Types • Three new types of compares: • and: both target predicates set FALSE if compare is false • or: both target predicates set TRUE if compare is true • DeMorgan: if true, sets one TRUE, sets other FALSE • Do not get confused with the “parallel compare” pcmp1/pcmp2/pcmp4 A B A C B D C Reduces Critical Path D

Eight Queen Example if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true)) Unconditional Compares 8 queens control flow R1=&b[j] R3=&a[i+j] R5=&c[i-j+7] ld R2=[R1] ld.s R4=[R3] ld.s R6=[R5] p1,p2=cmp.unc(R2==true) (p1)chk.s R4 (p1)p3,p4=cmp.unc(R4==true) (p3)chk.s R6 (p3)p5,p6=cmp.unc(R5==true) (p5) br then else 1 P2 P1 2 4 P4 P3 5 P6 P5 Else Then 6 7 Source: Crawford & Huck

Eight Queen Example if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true)) Parallel Compares R1=&b[j] R3=&a[i+j] R5=&c[i-j+7] p1 <- true ld R2=[R1] ld R4=[R3] ld R6=[R5] p1,p2 <- cmp.and(R2==true) p1,p2 <- cmp.and(R4==true) p1,p2 <- cmp.and(R6==true) (p1) br then else P2 1 P1 2 P4 P3 P1=False P1= true P6 4 P5 Else Else Then Then 5 Reduced from 7 cycles to 5 Source: Crawford & Huck

Use cmp.crel.and.orcm or cmp.crel.or.andcm for writing complementary predicates Also called DeMorgan type (for complementary output) c1 if (c1 && c2 && c3 && c4) r1 = r2 + r3; else r4 = r5 – r6 c2 c3 Itanium Code cmp.eq p1,p2 = r0,r0;; cmp.eq.and.orcm p1,p2 = c1,r0 cmp.eq.and.orcm p1,p2 = c2,r0 cmp.eq.and.orcm p1,p2 = c3,r0 cmp.eq.and.orcm p1,p2 = c4,r0 (p1) add r1=r2,r3 (p2) sub r4=r5-r6 0 c4 1 2 else then More Example of Parallel Compare Parallel cmp.crel.and or cmp.crel.or write the same values to both predicates

ld8 r6 = (ra) (p1) br exit1 (p2) chk r7, rec1 (p4) chk r8, rec2 (p1) br exit1 (p3) br exit2 (p5) br exit3 ld8 r7 = (rb) (p3) br exit2 ld8 r8 = (rc) (p5) br exit3 Multiway Branches Hoisting Loads • Multiway branches: more than 1 branch in a single cycle • Itanium allows multiple “consecutive” B instructions in the same inst group • Allows n-way branching (Itanium and Itanium 2 have 3 branch units) per cycle • Ordering matters if branch predicates are not mutually exclusive • E.g. BBB template enables 3 branches in one bundle w/o Speculation Multi-way Branches ld8 r6 = (ra) ld8.s r7 = (rb) ld8.s r8 = (rc) ld8 r6 = (ra) ld8.s r7 = (rb) ld8.s r8 = (rc) P1 (p1) br exit1 P2 chk r7, rec1 (p3) br exit2 P3 P4 chk r8, rec2 (p5) br exit3 P5 P6 3 branch cycles 1 branch cycle

Branch and Prefetch Hints • Compiler provides hints for branch predictor by • Completer in branch instructions, e.g. br.call.sptk • 4 completer types for static and dynamic predictions: sptk, spnt, dptk, dpnt • Explicit brp instructions • Compiler provide hints for instructionsequentialprefetching • Use completer in branch instructions, e.g. br.call.sptk.many • 2 completer types: many, few • Many and few are implementation-specific • Compiler directs predictor allocation • For managing branch predictor resources • Use completer in branch instructions, e.g. br.call.sptk.many.none • 2 completer types: none, clr • none: don’t deallocate; clr: deallocate branch info

Modulo Scheduling Support • Will be discussed next • Itanium features support modulo scheduling (or software pipelining) • Full Predication • Special branch handling features • br.ctop (for for-loop with known loop count) • br.wtop (for while-loop) • Register rotation: removes loop copy overhead • No modulo variable expansion, tighter code • Predicate rotation/generation • Removes prologue & epilogue

+ + + List Scheduling • Build dependency graph • Assign a priority of “0” to all operations having no successors • Assign each remaining operation the sum of priority and latency of their successor. If more than one successor, assign the maximum. • Schedule instructions based on priority C1 ld X1 11 P = Mem[A++] + C1; Q = P * C2; Y = P * C3 + (P + Q) * (P * C3); Mem[B++] = Y; 9 A1 C2 x M1 Latency: Mem — 1 cycle Adder — 2 cycles Multiplier — 2 cycles C3 7 x M2 5 5 A2 x M3 3 1 A3 Schedule = {X1, A1, M1, A2, M2, M3, A3, X2} st X2 0

+ + + List Scheduling Reservation Table C1 ld X1 11 9 A1 C2 x M1 C3 7 x M2 5 5 A2 x M3 3 1 A3 • LS (a heuristic) provides near-optimal schedule • But no guarantee for optimality, especially, in terms of throughput st X2 0

Scheduling • If I want to use the same schedule, what is the minimum initiation interval? • In the example, do I need to wait for 12 cycles? • If not, how do I avoid collision?

Modulo Scheduling [RauGlaeser’81] • A.k.a. “Polycyclic scheduling” or “Software pipelining” • Exploit ILP among loop iterations to maximize • Machine utilization • Throughput • Use a common schedule for the majority of iterations • Overlap execution of consecutive iterations • Constant initiation rate Initiation Interval (II) • Minimum II (MII) generates an optimal schedule with maximum throughput • Originally developed for polycyclic architecture (or horizontal architecture, or aka VLIW later) at TRW/ESL

The optimal schedule is constrained by the number of available resources Determine ResII (Resource minimal initiation interval) Successive iterations will be scheduled ResII cycles apart N(i) is the number of usage of resource i in a loop C(i) is the number of resources i Modulo Scheduling: Resource Constraint

+ + + Resource II C1 ld X1 • Assume 3 FUs • 1 adder with 2-cycle latency • 1 mult with 2-cycle latency • 1 mem unit with 1-cycle latency • Determine MII = Resource II A1 C2 x M1 C3 x M2 A2 x M3 A3 st X2

Modulo Reservation Table (MRT) New Schedule for 1 iteration MRT

Modulo Scheduled Loop Prolog Kernel, steady state (MRT schedule)

Modulo Scheduled Loop Last kernel Epilog

Another Modulo Schedule Example Given 2 adders (1-cycle) & 1 multiplier (2-cycle) B D A C E 3 + A1 A2 + 3 prolog x x M1 M2 1 1 + 5x kernel A3 0 Z MII = max(3/2, 2/1) = 2 epilog Modulo Reservation Table Multiplier is fully utilized

How to Perform Register Allocation? • We are overlapping multiple iterations into one schedule. • Example: iteration 1 to 5 are alive at the same time • Registers from multiple iterations are alive during a period of time MRT

Modulo Variable Expansion • Analyze the “life time” of an architecture register • Unroll the loop to enable modulo schedule • R5 needs to stay alive for 8 cycles = 8/3 = 3 MII (i.e. unroll 3 times) r1 (1) r2 (4) r3 (2) r5 (8) r4 (3) r6 (4) r7 (2) The cycle numbers assumes WAR allowed in the same cycle

Post MVE code Kernel (unrolled 3 times)

Register Allocation for MVE • To save # of registers, might not need to expand all registers • Calculate the lifetime of each register to determine if a new register is needed across iterations (the formula assumes WAR in the same instruction bundle is allowed) • # of copies = (MII % lifetime/MII == 0) ? lifetime/MII : MII • 14 5/14 • R1 is alive for 1 cycle = 1/3 = 1 MII (need 1 copy) • R2 is alive for 4 cycles = 4/3 = 2 MII (need 3 copies since 3%2=1) • R3 is alive for 2 cycles = 2/3 = 1 MII (need 1 copy) • R4 is alive for 3 cycles = 3/3 = 1 MII (need 1 copy) • R5 is alive for 8 cycles = 8/3 = 3 MII (need 3 copies) • R6 is alive for 4 cycles = 4/3 = 2 MII (need 3 copies since 3%2=1) • R7 is alive for 2 cycles = 2/3 = 1 MII (need 1 copy) • 13 registers used, instead of 21 with the same unrolling degree

MVE (reallocate registers) Kernel (unrolled 3 times) The cycle numbers assumes WAR allowed in the same cycle

Final Modulo Schedule Prolog Code (12 instruction bundles) 9 instruction bundles Epilog Code (12 instruction bundles) **Branch instruction not shown

Final Modulo Schedule (Reallocate Registers) Prolog Code (12 instruction bundles) 9 instruction bundles Epilog Code (12 instruction bundles) **Branch instruction not shown

Issues with Modulo Variable Expansion • Many architecture registers are needed • Code size gets bigger when more unrolling needed • Alternative solution: Rotating register file • A hardware technique • Solving problem without code duplication • Similar to register window plus renaming: keep old iteration values on the stack (Itanium calls the hardware Register Stack Engine or RSE)

Intention of Using Rotation Registers • Use exactly the same schedule (below) for all including • Kernel codes • Prolog codes • Epilog codes • The “registers” need to be re-allocated • Registers “rotate” per iteration!!! **Branch instruction not shown

Idea of Rotation Register (Original Schedule) In Intel Itanium, integer registers 32 – 127 are rotating registers

Original Code Schedule In Intel Itanium, integer registers 32 – 127 are rotating registers

ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

Presentation Transcript

CS 203A Advanced Computer Architecture

CS 704 Advanced Computer Architecture

CS152 Computer Architecture and Engineering Lecture 17 Dynamic Scheduling: Tomasulo

Static Scheduling for ILP

CS 203A Advanced Computer Architecture

ECE 252 / CPS 220 Advanced Computer Architecture I Lecture 18 Multiprocessors

Lecture 7 Instruction Scheduling

Duke Compsci 220 / ECE 252 Advanced Computer Architecture I

ECE/CS 552: Introduction To Computer Architecture

ECE 4100/6100 Advanced Computer Architecture Lecture 3 Performance

CS136, Advanced Architecture

CS136, Advanced Architecture

ECE 4100/6100 Advanced Computer Architecture Lecture 13 Multithreading and Multicore Processors

Advanced Computer Architecture

CS152 – Computer Architecture and Engineering Lecture 16 – Advanced Pipelining 2

Non-Uniform Cache Architecture

April 13, 2001 Prof. David A. Patterson Computer Science 252 Spring 2001

CS 5513: Computer Architecture Lecture 1: Introduction

Lecture 2: Intro to Computer Architecture

Lecture 5 Section A.8 Branch Hazards and Dynamic Scheduling via scoreboarding

John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley

ECE 4100/6100 Advanced Computer Architecture Lecture 8 Dynamic Scheduling (II)