1 / 18

From Adaptive to Self-Tuning Systems

Sudhakar Yalamanchili, Subramanian Ramaswamy and Gregory Diamos School of Electrical and Computer Engineering. From Adaptive to Self-Tuning Systems. Power. ILP. Leakage current increases 7.5X with each generation [3]. Pipeline in-order OOO aggressive OOO. Architectural Challenges.

lola
Download Presentation

From Adaptive to Self-Tuning Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sudhakar Yalamanchili, Subramanian Ramaswamy and Gregory Diamos School of Electrical and Computer Engineering From Adaptive to Self-Tuning Systems

  2. Power ILP Leakage current increases 7.5X with each generation [3] Pipeline in-order OOO aggressive OOO Architectural Challenges • Negative returns with power • Increasing inefficiencies due to • speculation • control flow Frequency Wall Power Wall Not much headroom left in the stage to stage times (currently 8-12 FO4 delays) [4] Single Thread Performance Memory Wall Source:http://techreport.com/reviews/2005q2/opteron-x75/dualcore-chip.jpg • Cache Area • 80% of transistor budget  50% of total area [1] • Defects in cache affect processor yield • Significant power consumers (e.g. > 40% of total power in Strong ARM)[2] • On-chip-DRAM gap continues to grow Economic Wall • Costs of developing next generation processors • Design & Manufacturing costs • Extreme Device Variability • P. Ranganathan, S. Adve, N. Jouppi. Reconfigurable Caches and their Application to Media Processing. ISCA 2000 • Michael Zhang, Krste Asanovic “Fine-Grain CAM-Tag Cache Resizing Using Miss Tags” ISLPED 02 • S. Borkar “Design Challenges of Technology Scaling” Micro 1999 • Vikas Agarwal, M. S. Hrishikesh, Stephen W. Keckler, Doug Burger. Clock rate versus IPC: the end of the road for conventional microarchitectures. In ISCA 2000

  3. Large scale P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M System View 1. Capture and adapt to intrinsic application behavior Dynamic, on-line, evolutionary behaviors Static, off-line characterizations Many-core, Heterogeneous System 2. Device-Level Variations reduce architecture yield Solution: Systems are self-tuning

  4. Ill- Structured Workloads Structured Workloads Rigid, HW/SW Boundaries Evolutionary or Self-Tuning Systems P P P M M M P M Traditional Architectures (Fixed) Architectures Change At SW-determined Points of Execution The Space of Solutions State of the Practice P M Architectures continuouslyautonomously evolve and adapt Ability to Customize Architectures Before Application Deployment

  5. From Adaptive to Self Tuning • Where do we make future investments in transistors and software? • Hardware software co-design for continuous monitoring and/or tuning • Expose and (dynamically) eliminate design redundancies • Two Examples • Cache memory hierarchy • On-Chip Networks

  6. Generational Behavior of Caches Memory Lines miss Idle interval hit new generation new generation Time 1. Kaxiras, S., Hu, Z. and Martonosi, M., "Cache Decay: Exploiting Generational Behavior to Reduce Cache Leakage Power“ ISCA 2001 2. Jaume Abella, Antonio Gonzlez, Xavier Vera, Michael F. P. O'Boyle “IATAC: a smart predictor to turn-off L2 cache lines.” TACO 2005

  7. Cache Tuning: Conceptual Model • Remap memory into the cache  shape the cache • Match the program footprint  resize the cache

  8. y z x Cache Tuning: System Model & Opportunities statement Static analysis or programmer supplied statement Placement( B[][], param ) Structured accesses remapping directive Region A loop Placement( B[][] , param) statement statement Profile based insertion end loop P L1 Run-time tuning Thread 2 Thread 1 L2 AT logic LUT Alternative implementations M

  9. Static Tuning: Scientific Applications • Targeted to programs with predictable access patterns • Compiler can both resizeand remap • Advanced compiler optimizations made possible

  10. Dynamic Tuning: Folding Heuristics • Find and utilize redundancies in the design • Miss folding fold misses via re-mapping memory lines into the same cache set Comparisons shown for a 256KB L2 cache S. Ramaswamy, S. Yalamanchili. Improving Cache Efficiency via Resizing + Remapping. ICCD 2007

  11. Tuning for Yield: Decreasing Defect Sensitivity* • Performance Yield  yield at a given performance (e.g. AMAT) for 1000 units • Up to four times greater than modulo placement • Exploiting redundancies  application to power management Recovering Design Inefficiencies S. Ramaswamy, S. Yalamanchili,“Customizable Fault Tolerant Caches for Embedded Processors,” ICCD 2006

  12. Opportunities • Voltage scaling • Combine voltage scaling and remapping for program phase dependent power management • Compiler-directed hardware optimizations • For example concurrent data layout + cache placement • Application to multi-threaded and multi-core domains • Cache sharing across threads • Challenge: coherency traffic

  13. The On-Chip Network • The network is in the critical path (performance) • Operand networks • Cache hierarchy • System on Chip • Increasing impact of wire (channel) delays • Wire delays must be actively managed • On-demand resource management • Initial studies: link tuning • Reference: Research at EPFL & Stanford on robust link design

  14. A System for Tuning and Actively Reconfiguring SoC Links (STARS) Too Fast Well Tuned Too Slow Latch 1 Value 1 Value 2 Latch 2 Value 1 Value 2 Latch 3 Value 1 Value 2 Time • Variable delays and and cascaded registers measure link delay • Digital PLL tunes the clock to match the link delay

  15. FPGA Tests Monitoring Find End of Link Transition Find Start of Link Transition Tuning Adjust Clock Frequency Determine Slack In the Link • Low speed tests to validate the control strategy

  16. Prototyping: 180nm • Variable Delay Elements (VDE) • Variable delay from 118ps to 1.47ns • 10 bits of resolution • 502 transistors • Digitally Controlled Oscillator (DCO) • Clock period from 240ps to 2.97ns • 10 bits of resolution • 528 transistors • Digital Clock Divider (DCD) • Min input clock period 480ps • 8 bits of resolution • 1127 transistors • Allows tuning links up to 2.083 GHz • From reference clock of 8.13MHz

  17. Extensions • Modulate link widths • Modulate buffer organizations • Channels/depth • Feedback between local congestion detection and link and buffer resources

  18. Summary • Application demands will be time varying • Technology will introduce time-varying hardware characteristics • Continuous cooperative HW/SW tuning provides a methodology for addressing these concerns • Need the support of abstractions for tuning • Influence of prior applications to datapaths (Razor-UMich), communication systems (Vizor-GT), and reliable links (Stanford/EPFL) • Build on existing research in cache performance & power management

More Related