1 / 23

Advanced Compilers CMPSCI 710 Spring 2003 Balanced Scheduling

This lecture discusses balanced scheduling, a method that spreads out instructions to cover load latency and improve performance by exploiting load-level parallelism. The lecture covers the algorithm, weight calculation, and limitations of balanced scheduling.

binns
Download Presentation

Advanced Compilers CMPSCI 710 Spring 2003 Balanced Scheduling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced CompilersCMPSCI 710Spring 2003Balanced Scheduling Emery Berger University of Massachusetts, Amherst

  2. Topics • Last time • Instruction scheduling • Gibbons & Muchnick • This time • Balanced scheduling • Kerns & Eggers

  3. List Scheduling, Redux • Build dependence dag • Choose instructions from ready list • Schedule using heuristics[Gibbons & Muchnick] • Instruction with greatest latency • Instruction with most successors • Instruction on critical path

  4. Fly in the Ointment • When scheduling loads, assume hit in primary cache • On older architectures, this makes sense: • Stall execution on cache miss • But newer architectures are nonblocking: • Processor executes other instructions while load in progress • Good – creates more ILP – but…

  5. Scheduling Options • Now what? • Assume cache miss takes N cycles • N typically 10 or more • Do we schedule load: • Anticipating 1 cycle delay (a hit)? • optimistic • Or N cycle delay (a miss)? • pessimistic

  6. Optimistic vs. Pessimistic • Optimistic: fine for hits, inferior for misses • Pessimistic: fine for hits, better for misses Optimistic L0 X2 X1 X3 X4 Pessimistic L0 X2 X3 X1 X4

  7. Optimistic vs. Pessimistic,Multiple Loads • Optimistic: better for hits, same for misses • Pessimistic: worse for hits, same for misses Optimistic L1 X1 L2 X2 X3 Pessimistic L1 X1 X2 L2 X3

  8. Balanced Scheduling • Key insights: • No fixed estimate of memory latency is best • Schedule based available parallelism in the code • Load level parallelism • Balanced scheduling: • Computes each weight separately • Takes other possible instructions into account • Space out loads, using available instructions as “filler”

  9. Balanced Scheduling,Example • Maximizes distance between L0 & X1 • Good in case of miss Balanced L0 X2 X3 X1 X4

  10. Balanced Scheduling,Example • W: load instruction weight • W=5 – over-estimate • Greedy schedule • W=1 – under-estimate • Lazy schedule • Balanced scheduler: • W=3 (= load-level parallelism)

  11. Balanced Scheduling,Results • Always achieves fewest interlocks

  12. Algorithm Idea • Examine each instruction i in dag • Determine which loads can run in parallel with i • Use all (or part) of i’s execution time to cover latency of loads

  13. Balanced Scheduling,Weight Calculation • Time complexity?

  14. Balanced Scheduling,Example • Locate longest load paths in connected components • Add 1/(# of loads) to load’s weights

  15. Balanced Scheduling,Example II • Consider instruction X1 • Locate longest load paths in connected components • Add 1/(# of loads) to load’s weights • “contributions of X1”

  16. Balanced Scheduling,All Weights

  17. Balanced Scheduling Algorithm • After computing weights, perform list scheduling where: • Priority = weight plus max priority of successors • Break ties: • Largest delta between consumed & defined registers • Rank based on successors in dag that would be exposed • Select instruction generated earliest • Bottom-up scheduler: • Reverse-order, schedule from leaves toward roots

  18. Balanced Scheduling,Example I Balanced L0 X2 X3 X1 X4

  19. Balanced Scheduling,Example II

  20. Limitations • Performed after register allocation • But: introduces false dependences • Reuse of registers ) dag has extra edges • Can be fixed with software register renaming • Had to modify gcc’s RTL • Approach required manual pipelining • Profile-based feedback… • Benchmark based on FORTRANconverted to C with f2c • Can’t disambiguate memory • Adds many edges to dag

  21. “Workaround”: Simulate Fortran • Modify code to avoid aliases • Improves results, but incorrect! • Needs advanced alias analysis

  22. Empirical Results • Evaluated using simulation • 3% to 18% improvement over regular scheduler across different models • Mean: 9.9% • Unfortunately: • No results presented without above-mentioned modifications…

  23. Conclusion • Balanced scheduling • Spreads out instructions to cover load latency • Based on exploitable load-level parallelism • Effective at improving performance • Modulo methodological limitations… • Not so great for C/C++, possibly useful for Java • Next time: interprocedural analysis • ACDI: Ch. 19, pp. 607-636, 641-656

More Related