1 / 18

“Flea-flicker” Multipass Pipelining: An Alternative to the High-Power Out-of-Order Offense

“Flea-flicker” Multipass Pipelining: An Alternative to the High-Power Out-of-Order Offense. Ronald Barnes George Mason University. Shane Ryoo and Wen-mei Hwu University of Illinois Urbana-Champaign. Dynamic scheduling approach:. Tolerating memory latency and finding

nairi
Download Presentation

“Flea-flicker” Multipass Pipelining: An Alternative to the High-Power Out-of-Order Offense

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. “Flea-flicker” Multipass Pipelining: An Alternative to the High-Power Out-of-Order Offense Ronald Barnes George Mason University Shane Ryoo and Wen-mei Hwu University of Illinois Urbana-Champaign Dr. Ronald D. Barnes Department of Electrical and Computer Engineering

  2. Dynamic scheduling approach: • Tolerating memory latency and finding ILP at runtime comes at heavy cost • Aggressive out-of-order execution incompatible with overriding power/power density concerns • ALPHA21264—18% of chip power, as much as int + fp exec • POWER4—10% of core power, scheduler highest power density • Power concerns influencing development towards efficiency rather than wide inst. window (Pentium M) In-order approach: • Rely on compiler-planned execution • Compiler techniques (e.g. prefetching) not solving problem of unanticipated memory latency Dr. Ronald D. Barnes Department of Electrical and Computer Engineering

  3. Dr. Ronald D. Barnes Department of Electrical and Computer Engineering

  4. Compiler Expressed Parallelism • Compiler can find a significant number of instructions for parallel execution on 6-issue processor Dr. Ronald D. Barnes Department of Electrical and Computer Engineering

  5. Compiler Expressed Parallelism • Dynamic stalls (of which cache misses are most important [Sias04]) drastically reduce observed performance Dr. Ronald D. Barnes Department of Electrical and Computer Engineering

  6. Dr. Ronald D. Barnes Department of Electrical and Computer Engineering

  7. In-order runahead performance Dr. Ronald D. Barnes Department of Electrical and Computer Engineering

  8. Dr. Ronald D. Barnes Department of Electrical and Computer Engineering

  9. Benefits of multipass approach Dr. Ronald D. Barnes Department of Electrical and Computer Engineering

  10. Key Multipass Contributions • Advance restart allows processing of newly woken insts. • Initial implementation relies on compiler-controlled restart • No expensive, fine-grain wakeup mechanism is needed • Re-use makes results of independent instructions persistent • Improves efficiency (no re-computation) • Hides long latency operations • Instruction Regrouping allows schedule-height reduction without reordering instructions Dr. Ronald D. Barnes Department of Electrical and Computer Engineering

  11. Implementation cost of Multipass • Speculative memory state discussed in paper Dr. Ronald D. Barnes Department of Electrical and Computer Engineering

  12. Experimental configuration • Benchmarks compiled with IMPACT C compiler using control-flow profiling and interprocedural alias analysis • Simulator augmented with power models of array structures Dr. Ronald D. Barnes Department of Electrical and Computer Engineering

  13. Comparison with Out-of-Order Execution Dr. Ronald D. Barnes Department of Electrical and Computer Engineering

  14. Comparison with Out-of-Order Execution Dr. Ronald D. Barnes Department of Electrical and Computer Engineering

  15. Overheads of Out-of-Order execution Register renaming hardware to overcome output and anti-dependencies Complex scheduling table to issue instructions as dependencies are met Increase in pipeline length Dr. Ronald D. Barnes Department of Electrical and Computer Engineering

  16. Power Ratio Comparison • Sequential, in-order access give multipass structures their advantage Dr. Ronald D. Barnes Department of Electrical and Computer Engineering

  17. Related approaches • In-order runahead [Dundas97] Runahead to extend out-of-order window [Mutlu03] • Checkpoint and repair run-ahead execution • All “pre-execution” results are thrown away • Subordinate microthreads [Chappel99] Speculative precomputation [Collins01] • Helper threads initiate memory accesses early • Two-pass pipelining [Barnes03] • In-order advance execution on a separate, tightly-coupled pipeline Dr. Ronald D. Barnes Department of Electrical and Computer Engineering

  18. Conclusions • Multipass execution provides an cache-miss latency tolerant microarchitecture • Advance restart facilitates the execution of independent, newly ready instructions • Initial implementation uses compiler-direction • Instruction regrouping achieves significant speedup by increasing “rally” mode throughput • Future work • Microarchitectural mechanism for controlling advance restart • Examination of tradeoffs between continuing (perhaps with prediction) vs. restarting advance execution • Partial reuse of results Dr. Ronald D. Barnes Department of Electrical and Computer Engineering

More Related