1 / 34

Iterative Modulo Scheduling

Iterative Modulo Scheduling. CS 771 – Optimizing Compilers Fall 2005 – Lecture 16. Homework #2. Min – 84 Max – 108 . Project Proposals. 7 proposals for 14 people 1-3 people per project Jikes RVM: 5, GCC: 2 Topics Security: 2 Fault tolerance: 1 Software test: 1

beau
Download Presentation

Iterative Modulo Scheduling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Iterative Modulo Scheduling CS 771 – Optimizing Compilers Fall 2005 – Lecture 16

  2. Homework #2 • Min – 84 Max – 108 CS 771, Fall 2005

  3. Project Proposals • 7 proposals for 14 people • 1-3 people per project • Jikes RVM: 5, GCC: 2 • Topics • Security: 2 • Fault tolerance: 1 • Software test: 1 • Compiler optimizations: 2 • Architectural support: 1 CS 771, Fall 2005

  4. Project Proposals Continued… • Evaluation Criteria (30 points) • Overall idea [8-10] • Demonstrated research potential [6-10] • Document flow/clarity/style [7-10] • Deadlines • Final presentations – last class or exam slot • Final paper (10 pages double-column) – One day before grades are due CS 771, Fall 2005

  5. Midterm • Next Thursday (10/27) in class • Short answer and problems (just like homework) • Will review topics covered on Tuesday (10/25) • No LCM (high level concepts are fair game) CS 771, Fall 2005

  6. Upcoming Talks… • Today at 3:30 in MEC 205 • Mark Levoy, “Light field photography and videography” • Tomorrow at 2:00 in Newcomb Hall South • Mark Levoy, “Digital Michelangelo & Digital Forma Urbis Romae” • Tomorrow at 3:30 • My CS 696 talk on “Solving Tomorrow’s Computing Challenges using Symbiotic Optimization” CS 771, Fall 2005

  7. Last Time … Global Scheduling • Speculation • Superblock scheduling • Software pipelining • Modulo scheduling CS 771, Fall 2005

  8. Recall… Software Pipelining Schedule (building block) Schedule (building block 2 cycles apart) • Cycles for 50 iterations: • first: 5 cycles; next 2 additional cycles • 5 cycles + 49 * 2 cycles = 103 cycles • See Rau 92 for general formula 0 L 1 2 3 4 0 1 L L 1 2 L L 2 3 + L 3 4 + S L 4 5 S + L 5 6 S L 7 + L 8 S 9 + 10 S CS 771, Fall 2005

  9. Modulo Scheduling • A very regular form of software pipelining • loop iterations use the same schedule • loop iterations are initiated at a constant rate • Advantages of modulo scheduling: • high performance • high throughput (steady-state performance) • short schedule length (transient-state performance) • simple form of software pipelining • efficient scheduling algorithm • simple bounds on throughput and register requirements • compact code • no code replication with sufficient hardware support CS 771, Fall 2005

  10. ld ld + * + * - - st st st ld - + * Modulo Scheduling Concepts 0 1 2 3 4 5 6 7 Schedule: Dependence graph: Stage 0 Stage 1 SC=SL/II Stage 2 Stage 3 Trace: Iteration 0 1 2 3 Time 0 1 2 3 4 5 6 7 8 9 10 11 12 ld Initiation Interval (II) + * ld + * ld - + * Modulo Reservation Table (MRT) ld st - + * st - st - st

  11. Modulo Scheduling Concepts (cont.) • Initiation Interval (II): • number of cycles between two consecutive iterations • constant • Modulo Reservation Table (MRT): • compact table (II rows, 1 column per resource) to track resource usage • used to find a modulo schedule that satisfies the resource constraints of the machine • Prologue/Epilogue: • period of cycles required to fill/drain the software pipeline • (Stage count – 1) * II CS 771, Fall 2005

  12. Iterative Modulo Scheduling Approach Overview of algorithm: [Rau, MICRO 92] • Compute lower bound on II (MII for Minimum II) • due to resource (ResMII for Resource MII) • due to latencies or recurrences (RecMII for Recurrence MII) • Try to find a schedule for MII = II • If attempt fails, try again with larger II CS 771, Fall 2005

  13. 1 1 2 2 3 3 1 1 2 2 3 3 1 2 3 1 1 2 2 3 3 Minimum II due to Resources (ResMII) • Dependence Reservation tables • Compute ResMII • max among all resources of: ceiling(# of resource used / # available resource) time t0 t1 t2 t3 t0 t1 a r0 r0 r1 r1 r2 r2 b resource r3 r3 1, 2, 3: iterations r0 r1 due to resources, cannot initiate iterations less than 2 cycles apart r2 r3 CS 771, Fall 2005

  14. Minimum II due to Resources (ResMII) • Dealing with alternatives: • one operation may be executed on multiple functional units with different resource usages • Example: "move r2 to r1": r1 = r2; r1 = r2+0; r1 = r2 * 1; • Algorithm (approximation of ResMII): order operations by increasing number of alternatives for (each operation, in order) do select alternative that yields the lowest ResMII end CS 771, Fall 2005

  15. Recurrence Constraints • Dependences • Inter-iteration dependences • Intra-iteration dependences • Anti and output dependences are assumed to have been eliminated • Recurrence – if one iteration has dependence on the same operation in a previous iteration • Direct or indirect • Data or control dependence • Distance – number of iterations separating the two dependent instructions (0=same iteration) CS 771, Fall 2005

  16. a b a b a b a b a b a b Minimum II due to Recurrences (RecMII) Dependence Schedule Dependence Schedule • Compute Recurrence Minimum II (RecMII): • Delay(c) : sum of latencies along cycle c • Distance(c): sum of dependence distances along cycle c • smallest RecMII for which: RecII * Distance(c) >= Delay(c), for all cycles • max among all cycles of: ceiling(Delay(c) / Distance(c)) a a [0] [1] [0] [3] b b [x] dependence distance CS 771, Fall 2005

  17. ld + * - st - st - st - st - st Effective Dependence Latency • What is the latency of a dependence: • def and use operation in same iteration: latency of operation • def operation in this iteration, use operation in x iterations: latency - x * II • Example: Iteration 0 1 2 3 Time 0 1 2 3 4 5 6 ld + * ld + * [1] ld + * ld + * effective latency between * and - a single iteration is 4 - 1*2 = 2 because the dependence really spans one iteration CS 771, Fall 2005

  18. Iterative Modulo Scheduler II=4 greedy: fail & increase II II=4 II=6 II=4 iterative: unschedule conflicting ops and reschedule them later CS 771, Fall 2005

  19. Iterative Modulo Scheduler (cont.) Algorithm: II := minimum feasible initiation interval; while (true) do initialize schedule and budget; while (not all operations scheduled and budget > 0) do op := highest priority operation; min-time := earliest scheduling time of op; max-time := min-time + II -1; time-slot := find timeslot for op betw min and max-time; schedule op at timeslot, unscheduling all conflicting ops; budget := budget -1; od; if (scheduled all operations) then break fi; II := II + 1; od; CS 771, Fall 2005

  20. Iterative Modulo Scheduler (cont.) • Benchmark example: • 1327 loops from the Perfect Club (1002), SPEC-89 (298), and Livermore (27) • compiled for the Cydra5 machine • Optimizations: • load-store elimination • recurrence back-substitution • IF-conversion CS 771, Fall 2005

  21. What is IF-Conversion? • Many branches can be removed if we have architectural support for predication • Converts control dependences to data dependences B1 Inst 1 Inst 2 If (A) B2 else B3 Inst 1 Inst 2 (A) Inst 3 (A) Inst 4 (!A) Inst 5 (!A) Inst 6 Inst 7 Inst 8 Inst 9 B2 B3 Inst 3 Inst 4 Goto B4 Inst 5 Inst 6 Goto B4 B4 Inst 7 Inst 8 Inst 9 CS 771, Fall 2005

  22. Code Generation for Modulo Schedules • Depending on the machine support, different code generation techniques are used: • rotating registers • r1 becomes r2, r2 becomes r3,... r32 becomes r0 • special branch operation & predicated operations • special branch manipulates predicates controlling the prologue & epilogue CS 771, Fall 2005

  23. No Support for Modulo Scheduling • Example: • how many registers are needed? ld * ld * ld ld * - ld ld st - * st ld ld - * st ld - st ld - st CS 771, Fall 2005

  24. A1 B1 A2 C1 B2 A3 D1 C2 B3 A1 D2 C3 B1 A2 D3 C1 B2 A3 D1 C2 B3 D2 C3 D3 Minimum Unrolling due to Reg Renaming Minimum unrolling for registers (Modulo Variable Expansion) • K min = Max over all lifetimes i: ceiling((endi - starti +1) / II) • previous example: K min = Code generation: ld A * B ld C - st D CS 771, Fall 2005

  25. A1 B1 A2 C1 B2 A3 D1 C2 B3 A1 D2 C3 B1 A2 D3 C1 B2 A3 D1 C2 B3 D2 C3 D3 Loop Trip Count • Loop unrolled K times, has SC stages: • will execute (SC-1) + K * i • Has a loop that is executed L times • preconditioning loop will execute M times • M = L if L < SC -1 • M = (L - (SC -1) % K otherwise • modulo scheduled loop then executes • (L- M) / K times • ( here K=3, SC = 4) CS 771, Fall 2005

  26. Code without Preconditioning • Problems with preconditioning [Rau, MICRO 92] • preconditioned loop is not software pipelined • large performance loss if short trip count and large unrolling factors • code expansion is large • Preconditioned loop can be eliminated: • code expansion is even larger A1 1x B1 A2 B1 2x C1 B2 A3 C1 B2 C1 3x D1 C2 B3 A1 D1 C2 D1 (3i + 1)x D2 C3 B1 A2 D2 C3 B1 D2 D3 C1 B2 A3 D3 C1 (3i + 2)x D1 C2 B3 D1 D2 C3 D3 C1 B2 D3 D1 C2 D2 CS 771, Fall 2005

  27. ld * ld * ld ld - * st ld - st ld - Rotating Registers • Hardware: • Register + ICP = Physical Register • ICP (iteration control pointer) is decremented at the branch of the software pipelined loop • Example: ICP = 6 ICP = 5 ICP = 4 ICP = 3 ICP = 2 CS 771, Fall 2005

  28. Allocation of Rotating Registers (cont.) • Algorithm: [Rau et al, PLDI 94] • place lifetimes in 2-dimensional graph such that none lifetimes overlap • Once an allocation is found: • set each def and use of operand using the logical register name • note that over a lifetime, a new logical name is needed each time that the branch is encountered CS 771, Fall 2005

  29. Rotating Registers / No Precondition • Code generation: • no unrolling is needed. • preconditioning may still be needed to handle trip count smaller than SC • may replicate code to avoid preconditioning • Example: A B A B C B A C B C D C B A D C D D C B D D C D CS 771, Fall 2005

  30. Branch Support for Modulo Scheduling • Special branch: • LC: loop count • ESC: stages to drain pipe • ICP: for rotating registers • Pred: guard operations in loop BRTOP LC > 0 false ESC > 0 false ICP --; Pred = 1; LC --; branch ICP --; Pred = 0; ESC --; branch branch not taken CS 771, Fall 2005

  31. Modulo Schedules for While Loops • First approach: No speculation across iterations ld A * A B II = 6 B ld C C =0 - D A st B D C D A B C D CS 771, Fall 2005

  32. Modulo Schedules for While Loops (cont.) Second approach: With speculation across iterations • non speculative stage: D because of store • all nonspeculative operations must be after the branch • search for a modulo schedule as for do-loops ld A * II = 2 A B A B C B A ld D C B C speculative stages A,B =0 - D C st D D CS 771, Fall 2005

  33. Modulo Schedules for While Loops (cont.) • Code generation issues: • remaining stages of speculatively initiated loop iterations are discarded • Examples: • branch is at the end of stage C • branch is at the end of stage B A B A B C B A C B C D C B A D C D D C B D D C D A B A B C B A C B C D C B A D C D D C B D iterations that are kept D C x D CS 771, Fall 2005

  34. Summary • Modulo Scheduling – regular form of SWP • Generating a schedule • Allocating registers for the schedule • Effects of hardware support CS 771, Fall 2005

More Related