Loading in 2 Seconds...

ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

Loading in 2 Seconds...

- 172 Views
- Uploaded on

Download Presentation
## ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

**ECE 4100/6100Advanced Computer ArchitectureLecture 15 Static**Scheduling Machines Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology**Static Scheduling**• Compiler performs instruction scheduling • VLIW Very Long Instruction Word • An alternative to dynamic scheduling processors • Pack multiple operations into one instruction • Move scheduling to Compiler (Software Approach) • Can simplify the complexity of a hardware-based instruction scheduler • Cydrome, Multiflow, EPIC**Very Long Instruction Word (VLIW)**• Rely on Compilers • Simple Hardware • Dependency is explicitly represented in the instructions • Instruction window, supposedly, is much larger than a hardware scheduling window • How about loop boundary? • How about function boundary? • Interprocedural optimization is generally difficult • Might lead to compatibility or performance issues if instruction latency changed • EPIC/Itanium closely follows VLIW philosophy, many embedded and DSP processors embrace VLIW**Instruction Slot 1**Instruction Slot 2 Instruction Slot 3 Templt Intel Itanium ISA • Itanium Instruction “Bundle” (VLIW) • 128 bits each • Contains three Itanium instructions (aka syllables) • Template bits in each bundle specify dependencies both within a bundle as well as between sequential bundles • A collection of independent bundles forms a “group” (use stops) • Each Itanium Instruction • Fixed-length 41 bits long • Left-most 4 bits (40-37) are the major opcode (e.g. FP ld/st, INT ld/st, ALU) • Contains max three 7-bit register specifiers • Contains a 6-bit field for specifying one of the 64 one-bit qualifying predicate registers 127 86 45 5 0 4**Encoding Instruction Bundle**{ .mii ld4 r28=[r8] add r9 = 2,r1;; add r30= 1,r9 } MI_I format Template encoded “02” • Use “;;” as “stop bit” in assembly code to separate dependent instructions • Instructions between “;;” belong to the same “instruction group” • RAW and WAW are not allowed in the same instruction group • WAR is allowed except for an special case: when writing p63 by modulo-scheduled branch (e.g. br.ctop) after reading p63 (e.g. qualifying predicate) by B-type instruction • Each instruction slot can represent one (out of 5) functional unit type based on encoding (e.g. slot 0 can be M-unit or B-unit) • 12 basic templates provided, each with 2 versions depending on stop bit • MII, MI_I, MLX, MMI, M_MI, MFI, MMF, MIB, MBB, BBB, MMB, MFB • MII_, MI_I_, MLX_, MMI_, M_MI_, MFI_, MMF_, MIB_, MBB_, BBB_, MMB_, MFB_**Itanium Instruction Example**{ .mii add r1 = r2, r3 sub r4 = r4, r5;; shr r1, r4, r1;; } { .mmi ld8 r2, [r1];; st8 [r1] = r23 tbit p1,p2 = r4, 5 } { .mbb ld8 r45 = [r55] (p3)br.call b1=func1 (p4)br.cond Label1 } { .mfi st4 [r45] = r6 fmac f1=f2,f3 add r3=r3, 8;; }**Itanium Register Files**0 0 81 63 127 127 Stacked (Rotating) Stacked (Rotating) Stacked (Rotating) 0 63 32 32 31 31 Static Static Static 16 15 0 0 0 Predicate Registers General Purpose Registers FP Registers**sof**sol Register Stack Engine 127 • Avoid spills/fills during function call/return • Callee uses instruction alloc r1=ar.pfs, i, l, o, r upon entering a function illegal size of frame (sof) outputs size of locals (sol = i+l) locals size of rotating (sor) (inputs) 32 31 Static 0 Current Frame Marker (CFM) 38 bits rrb.pr rrb.fr rrb.gr sor**r127**r38 r33 b[i] r32 i*i r43 r32 GPR Callee (foo) Function Call Example r127 main(){ a=foo(i*i, b[i]); } int foo(int ii, int bb) { } r45 b[i] r44 i*i r43 main: alloc r32=ar.pfs,0,12,2,0 foo: alloc r26=ar.pfs,2,5,0,0 r32 GPR Caller (main)**52**out 46 38 out loc 32 call 32 sol sof sol sof 0 7 CFM 14 21 PFS.pfm x x 14 21 RSE: A Function Call pfm: Previous frame marker**50**out 48 loc 52 out 46 32 38 out loc 32 call alloc r32=ar.pfs,7,9,3,0 32 sol sol sof sof sol sof 16 0 19 7 CFM 14 21 PFS.pfm x x 14 14 21 21 RSE: Alloc inputs alloc copies PFM to GR (r32)**50**out 48 loc 52 52 out out 46 46 32 38 out loc loc 32 return call alloc 32 32 sol sol sol sof sof sof sol sof 14 16 0 19 21 7 CFM 14 21 PFS.pfm x x 14 14 14 21 21 21 RSE: Return**Itanium Pipelines**Ckt improved Front-end Dependency Scoreboard Stall checked here prior to EXE • Performance improvement due to pipeline shortening — 4% to 6% • Large integer register file cause extra stage WLD (Word Line Decode) in Itanium, circuit improved for Itanium 2 • Inter-group latency is enforced by a scoreboard • Latency due to scheduling that failed to space instructions out • Due to cache misses**Itanium 2 Eight-stage Pipeline**FP FP1 FP2 FP3 FP4 WB Core IPG ROT EXP REN REG EXE DET WB L2N L2I L2A L2M L2D L2C L2W L2**L1 I-Cache &**Fetch/Prefetch engine I-TLB B B B M M M M I I F F Itanium 2 Microarchitecture IA-32 Decode & Control Branch Prediction Instruction Queue 8 bundles 11 issue ports Register stack engine / remapping On-chip PIPT Unified L3 Cache Single-ported (ECC) PIPT Unified L2 Cache Quad-Port (ECC) Branch & Predicate 128 INT Registers 128 FP Registers Scoreboard, Predicate NaT, Exceptions Branch Units INT & MM Units Quad-port (INT) L1 PIPT Data Cache (WT)D-TLB Floating Point Units Branch Units INT & MM Units Floating Point Units ALAT Branch Units INT & MM Units INT & MM Units INT & MM Units INT & MM Units Bus Controller (ECC)**Control Speculation (Speculative Load)**• To improve memory latency by control speculation at compile time • Defer exceptions by setting NaT (GR’s 65th bit) that indicates: • Whether or not an exception has occurred • Branch to fixup code required • NaT set during ld.s, checked by chk.s Conventional Architectures Itanium ld.s instr 1 instr 2 instr 1 instr 2 . . . br Barrier br Load use chk.s use Elevate loads above a branch**Control Speculation (Hoist Uses)**IA-64 ld.s instr 1 instr 2 br chk.s use • The uses of speculative data can be executed speculatively • Distinguishes speculation from simple prefetch • NaT bit propagates down to the dependent instruction chain**Control Speculation (Recovery)**• All computation instructions propagate NaTs to the consumers to reduce number of checks • Cmp propagates “false” if NaT is set when writing predicates (“0” for both target predicates) ld8.s r3 = (r9) ld8.s r4 = (r10) add r6 = r3, r4 ld8.s r5 = (r6) p1,p2 = cmp(...) Recovery code ld8 ld8 add ld8 br home chk.s r5, recv sub r7 = r5,r2 Allows single chk on result**Data Speculation (Advanced Loads)**• Compiler can hoist a load prior to a preceding, possibly-conflicting store • ALAT (Advanced Load Address Table) is used for checking every store address in-between • Can be done by superscalar machine using Store coloring Conventional Architectures Itanium instr 1 ld8.a instr 1 instr 2 instr 2 . . . st8 Barrier st8 ld8 use ld.c use**ld8.a r3=**instr 1 add =r3, instr 2 st8 Recovery code ld8 r3= add =r3, br L1 chk.a L1: Data Speculation (load.a + chk.a) • Compiler hoist a load and its subsequent consumers prior to a preceding, possibly-conflicting store • Need to patch a recovery code for mis-speculation ld8.a r3= instr 1 instr 2 st8 ld.c add =r3,**Parallel Compare Types**• Three new types of compares: • and: both target predicates set FALSE if compare is false • or: both target predicates set TRUE if compare is true • DeMorgan: if true, sets one TRUE, sets other FALSE • Do not get confused with the “parallel compare” pcmp1/pcmp2/pcmp4 A B A C B D C Reduces Critical Path D**Eight Queen Example**if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true)) Unconditional Compares 8 queens control flow R1=&b[j] R3=&a[i+j] R5=&c[i-j+7] ld R2=[R1] ld.s R4=[R3] ld.s R6=[R5] p1,p2=cmp.unc(R2==true) (p1)chk.s R4 (p1)p3,p4=cmp.unc(R4==true) (p3)chk.s R6 (p3)p5,p6=cmp.unc(R5==true) (p5) br then else 1 P2 P1 2 4 P4 P3 5 P6 P5 Else Then 6 7 Source: Crawford & Huck**Eight Queen Example**if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true)) Parallel Compares R1=&b[j] R3=&a[i+j] R5=&c[i-j+7] p1 <- true ld R2=[R1] ld R4=[R3] ld R6=[R5] p1,p2 <- cmp.and(R2==true) p1,p2 <- cmp.and(R4==true) p1,p2 <- cmp.and(R6==true) (p1) br then else P2 1 P1 2 P4 P3 P1=False P1= true P6 4 P5 Else Else Then Then 5 Reduced from 7 cycles to 5 Source: Crawford & Huck**Use cmp.crel.and.orcm or cmp.crel.or.andcm for writing**complementary predicates Also called DeMorgan type (for complementary output) c1 if (c1 && c2 && c3 && c4) r1 = r2 + r3; else r4 = r5 – r6 c2 c3 Itanium Code cmp.eq p1,p2 = r0,r0;; cmp.eq.and.orcm p1,p2 = c1,r0 cmp.eq.and.orcm p1,p2 = c2,r0 cmp.eq.and.orcm p1,p2 = c3,r0 cmp.eq.and.orcm p1,p2 = c4,r0 (p1) add r1=r2,r3 (p2) sub r4=r5-r6 0 c4 1 2 else then More Example of Parallel Compare Parallel cmp.crel.and or cmp.crel.or write the same values to both predicates**ld8 r6 = (ra)**(p1) br exit1 (p2) chk r7, rec1 (p4) chk r8, rec2 (p1) br exit1 (p3) br exit2 (p5) br exit3 ld8 r7 = (rb) (p3) br exit2 ld8 r8 = (rc) (p5) br exit3 Multiway Branches Hoisting Loads • Multiway branches: more than 1 branch in a single cycle • Itanium allows multiple “consecutive” B instructions in the same inst group • Allows n-way branching (Itanium and Itanium 2 have 3 branch units) per cycle • Ordering matters if branch predicates are not mutually exclusive • E.g. BBB template enables 3 branches in one bundle w/o Speculation Multi-way Branches ld8 r6 = (ra) ld8.s r7 = (rb) ld8.s r8 = (rc) ld8 r6 = (ra) ld8.s r7 = (rb) ld8.s r8 = (rc) P1 (p1) br exit1 P2 chk r7, rec1 (p3) br exit2 P3 P4 chk r8, rec2 (p5) br exit3 P5 P6 3 branch cycles 1 branch cycle**Branch and Prefetch Hints**• Compiler provides hints for branch predictor by • Completer in branch instructions, e.g. br.call.sptk • 4 completer types for static and dynamic predictions: sptk, spnt, dptk, dpnt • Explicit brp instructions • Compiler provide hints for instructionsequentialprefetching • Use completer in branch instructions, e.g. br.call.sptk.many • 2 completer types: many, few • Many and few are implementation-specific • Compiler directs predictor allocation • For managing branch predictor resources • Use completer in branch instructions, e.g. br.call.sptk.many.none • 2 completer types: none, clr • none: don’t deallocate; clr: deallocate branch info**Modulo Scheduling Support**• Will be discussed next • Itanium features support modulo scheduling (or software pipelining) • Full Predication • Special branch handling features • br.ctop (for for-loop with known loop count) • br.wtop (for while-loop) • Register rotation: removes loop copy overhead • No modulo variable expansion, tighter code • Predicate rotation/generation • Removes prologue & epilogue**+**+ + List Scheduling • Build dependency graph • Assign a priority of “0” to all operations having no successors • Assign each remaining operation the sum of priority and latency of their successor. If more than one successor, assign the maximum. • Schedule instructions based on priority C1 ld X1 11 P = Mem[A++] + C1; Q = P * C2; Y = P * C3 + (P + Q) * (P * C3); Mem[B++] = Y; 9 A1 C2 x M1 Latency: Mem — 1 cycle Adder — 2 cycles Multiplier — 2 cycles C3 7 x M2 5 5 A2 x M3 3 1 A3 Schedule = {X1, A1, M1, A2, M2, M3, A3, X2} st X2 0**+**+ + List Scheduling Reservation Table C1 ld X1 11 9 A1 C2 x M1 C3 7 x M2 5 5 A2 x M3 3 1 A3 • LS (a heuristic) provides near-optimal schedule • But no guarantee for optimality, especially, in terms of throughput st X2 0**Scheduling**• If I want to use the same schedule, what is the minimum initiation interval? • In the example, do I need to wait for 12 cycles? • If not, how do I avoid collision?**Modulo Scheduling [RauGlaeser’81]**• A.k.a. “Polycyclic scheduling” or “Software pipelining” • Exploit ILP among loop iterations to maximize • Machine utilization • Throughput • Use a common schedule for the majority of iterations • Overlap execution of consecutive iterations • Constant initiation rate Initiation Interval (II) • Minimum II (MII) generates an optimal schedule with maximum throughput • Originally developed for polycyclic architecture (or horizontal architecture, or aka VLIW later) at TRW/ESL**The optimal schedule is constrained by the number of**available resources Determine ResII (Resource minimal initiation interval) Successive iterations will be scheduled ResII cycles apart N(i) is the number of usage of resource i in a loop C(i) is the number of resources i Modulo Scheduling: Resource Constraint**+**+ + Resource II C1 ld X1 • Assume 3 FUs • 1 adder with 2-cycle latency • 1 mult with 2-cycle latency • 1 mem unit with 1-cycle latency • Determine MII = Resource II A1 C2 x M1 C3 x M2 A2 x M3 A3 st X2**Modulo Reservation Table (MRT)**New Schedule for 1 iteration MRT**Modulo Reservation Table (MRT)**New Schedule for 1 iteration MRT**Modulo Scheduled Loop**Prolog Kernel, steady state (MRT schedule)**Modulo Scheduled Loop**Last kernel Epilog**Another Modulo Schedule Example**Given 2 adders (1-cycle) & 1 multiplier (2-cycle) B D A C E 3 + A1 A2 + 3 prolog x x M1 M2 1 1 + 5x kernel A3 0 Z MII = max(3/2, 2/1) = 2 epilog Modulo Reservation Table Multiplier is fully utilized**How to Perform Register Allocation?**• We are overlapping multiple iterations into one schedule. • Example: iteration 1 to 5 are alive at the same time • Registers from multiple iterations are alive during a period of time MRT**Modulo Variable Expansion**• Analyze the “life time” of an architecture register • Unroll the loop to enable modulo schedule • R5 needs to stay alive for 8 cycles = 8/3 = 3 MII (i.e. unroll 3 times) r1 (1) r2 (4) r3 (2) r5 (8) r4 (3) r6 (4) r7 (2) The cycle numbers assumes WAR allowed in the same cycle**Post MVE code**Kernel (unrolled 3 times)**Register Allocation for MVE**• To save # of registers, might not need to expand all registers • Calculate the lifetime of each register to determine if a new register is needed across iterations (the formula assumes WAR in the same instruction bundle is allowed) • # of copies = (MII % lifetime/MII == 0) ? lifetime/MII : MII • 14 5/14 • R1 is alive for 1 cycle = 1/3 = 1 MII (need 1 copy) • R2 is alive for 4 cycles = 4/3 = 2 MII (need 3 copies since 3%2=1) • R3 is alive for 2 cycles = 2/3 = 1 MII (need 1 copy) • R4 is alive for 3 cycles = 3/3 = 1 MII (need 1 copy) • R5 is alive for 8 cycles = 8/3 = 3 MII (need 3 copies) • R6 is alive for 4 cycles = 4/3 = 2 MII (need 3 copies since 3%2=1) • R7 is alive for 2 cycles = 2/3 = 1 MII (need 1 copy) • 13 registers used, instead of 21 with the same unrolling degree**MVE (reallocate registers)**Kernel (unrolled 3 times) The cycle numbers assumes WAR allowed in the same cycle**Final Modulo Schedule**Prolog Code (12 instruction bundles) 9 instruction bundles Epilog Code (12 instruction bundles) **Branch instruction not shown**Final Modulo Schedule (Reallocate Registers)**Prolog Code (12 instruction bundles) 9 instruction bundles Epilog Code (12 instruction bundles) **Branch instruction not shown**Issues with Modulo Variable Expansion**• Many architecture registers are needed • Code size gets bigger when more unrolling needed • Alternative solution: Rotating register file • A hardware technique • Solving problem without code duplication • Similar to register window plus renaming: keep old iteration values on the stack (Itanium calls the hardware Register Stack Engine or RSE)**Intention of Using Rotation Registers**• Use exactly the same schedule (below) for all including • Kernel codes • Prolog codes • Epilog codes • The “registers” need to be re-allocated • Registers “rotate” per iteration!!! **Branch instruction not shown**Idea of Rotation Register (Original Schedule)**In Intel Itanium, integer registers 32 – 127 are rotating registers**Original Code Schedule**In Intel Itanium, integer registers 32 – 127 are rotating registers

Download Presentation

Connecting to Server..