1 / 32

Day 4: Symbiotic Optimization

Day 4: Symbiotic Optimization. Kim Hazelwood ACACES Summer School July 2009. Modern Computing Challenges. Performance Power & temperature Reliability Parallelism (multicore) Heterogeneity Limited resources (embedded computing). SW. HW. Typical Approaches.

dareh
Download Presentation

Day 4: Symbiotic Optimization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Day 4: Symbiotic Optimization Kim Hazelwood ACACES Summer School July 2009

  2. Modern Computing Challenges • Performance • Power & temperature • Reliability • Parallelism (multicore) • Heterogeneity • Limited resources (embedded computing)

  3. SW HW Typical Approaches • Optimize using SW or HW techniques in isolation • Performance • SW: Compile-time optimizations, OS scheduling • HW: Architectural improvements, VLSI technology • Reliability: Code/data duplication (HW or SW) • Power & Temperature • HW control mechanisms • Profile + recompile cycle

  4. Modern Design Constraints • Compilers – “Compile once, run anywhere” • Cannot ship “MS Office for 3Q08 batch of Core2 3GHz, > 8GB RAM, BrandX power supply, located in high altitudes…” • OSes – Scheduling algorithms don’t scale to modern architectures • Microarchitecture – Limited window of application knowledge • Past must predict the future • Circuits – Guaranteed correctness, reliability, • Must design for the worst case

  5. Tortola: Symbiotic Optimization • Enable HW/SW Communication • What small changes can we make to HW/OS to enable collaborative solutions? SW Applications runtime traits Binary Modifier hardware feedback; os load scheduling hints OS/HW

  6. x86 Initially x86 SWI Eventually HWI The Power of Virtualization • No longer restricted to a fixed ISA • Reduce hardware complexity • No more backwards compatibility warts • Fix bugs after shipment • Reduce time to market for new architectures SW Applications Binary Modifier HW

  7. Tortola Applications • Combine global program information with run-time feedback • System-specific power usage • Application-specific heat anomalies • Workload/input specific performance optimization • Two case studies • The di/dt problem – solved using hardware feedback and dynamic optimization • Heterogeneous multicore scheduling – solved using hardware feedback, OS feedback, and OS hints from the binary modifier

  8. The di/dt Problem • Voltage stability is important for reliability, performance • Low-power techniques have a negative side effect: current variation • ITRS cites noise management as a Grand Challenge for 5-10 year time frame • Dips(undershoots) in supply voltage – can cause incorrect values to be calculated or stored • Spikes (overshoots) in supply voltage – can cause reliability problems

  9. Co-Designed MicroArch & SW Binary Modifier di/dt Solutions Software MicroArch Circuit-Level Compiler Optimizations Sensor/Actuator Mechanisms Decoupling capacitors More Vdd Gnd pins on package

  10. Sensor-Actuator Mechanisms • On-chip voltage sensors detect abnormally high/low voltage levels • On-chip actuator then attempts to quickly raise/lower the processor’s current draw • Phantom firing • increases current (at the expense of power) • Resource throttling • reduces current (at the expense of performance)

  11. Detecting Imminent Emergencies Hard Emergency Soft Emergency Control Threshold 1.05V 1.03V 1V 0.97V 0.95V Operating Voltage Range

  12. 20 cycles 60 cycles Minimum Voltage Maximum Voltage Minimum Voltage Targeting Mid-Frequency Di/dt • Problematic:wide current spike • Worst case: pulse at the resonant frequency Processor Current (A) Processor Current (A) *From: Joseph et al. HPCA 2003 Supply Voltage (V) Supply Voltage (V) Time (Cycles) Time (Cycles)

  13. A di/dt Stressmark • But…Actuator engages every loop iteration degrading performance • Why not correct the problem in the code? BEGIN_LOOP: … ldt $f1, ($4) divt $f1, $f2, $f3 divt $f3, $f2, $f3 stt $f3, 8($4) ldq $7, 8($4) cmovne $31, $7, $3 stq $3, $(4) stq $3, $(4) stq $3, $(4) … stq $3, $(4) … JUMP BEGIN_LOOP Sequential Low Power Parallel High Power

  14. Why use Dynamic Binary Modification? • Modify the instruction stream at run time • Much easier to react to an emergency than to predict one • Emergencies are processor dependent, but software should not be! • Enables run-time guarantees

  15. Proposed Solution • Leverage our additional software layer to supplement existing solutions • Microarchitectureprovides feedback to our software-based virtual layer Altered Executable Binary Modifier VL Executable SW HW Sensor+Actuator Ext Microprocessor

  16. Required Investigations • Characterizing emergencies • How often do we see di/dt emergency loops? • Are emergencies usually code-based? • Communication between the microarchitecture and the virtual layer • What information should be passed to virtual layer during an emergency? • Fixing di/dtvia binary modification • Will existing techniques help? • New algorithms?

  17. Last-Executed Branch Data suggests modifying a few code sequences will eliminate many voltage emergencies

  18. Possible Compiler Optimizations • Our goal is to • Smooth out current profile, or • Knock pulses off of the resonant frequency • Some existing options • Software pipelining, code motion, instruction padding Executable Apply Optimizations Altered Executable Binary Modifier Sensor+Actuator Ext’ns Microprocessor

  19. A A B B Iteration=1 A A B B Iteration=2 Software pipelining smoothes profile A A B B Iteration=3 Current Loop Unrolling & SW Pipelining A A B B Problematic loop: Current A A A A B B B B Loop unrolling disrupts resonance pulse Unrolled loop: Current

  20. Unrolling the Di/dt Stressmark H1 H H2 L L1 L2

  21. Two Case Studies • The di/dt problem – solved using hardware feedback and dynamic optimization • Heterogeneous multicore scheduling – solved using hardware feedback, OS feedback, and hints from the binary modifier

  22. Heterogeneous Multicores • Process Heterogeneity • Process variation • Design Heterogeneity • Specialized processor cores Today’s Multicore Designs Future Multicore Designs

  23. The Challenge: Scheduling • Many OSes assume identical core resources • Bad assumption, even today (hyperthreading) • The OS may not have enough information to make the best scheduling decisions • depending on the type of heterogeneity • Ideal process-to-core mappings change dynamically • Should this be a task for the OS alone?

  24. Our Approach • Claim: OSes could benefit from scheduling hints • Solution: Combine historical performance data with application phase information SW Applications phase information Binary Modifier performance counter info; os load scheduling hints OS/HW

  25. Initial Investigation • Hardware configuration • In-order core • Out-of-order core • Software configuration • Two applications executing (SPEC all combinations) • Migration indicators • IPC (from HW) • Phase (from SW) 1M cycle granularity App1 App2 out of order in order

  26. Results • Calculated ideal (omniscient) scheduling • Calculated random scheduling • Explored two migration heuristics • IPC-threshold scheduling • IPC-delta scheduling • Metric: Distance from ideal

  27. Random Scheduling 60% 40% 20% 0% Distance from Ideal 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Probability of Swapping at Each 1M Timeslice

  28. IPC Threshold Scheduling 30 20 10 0 Percent Distance from Ideal 0 Backoff 0.5 Backoff 1.0 Backoff 1.5 Backoff 1.9 Backoff 2.0 Backoff 0.5 IPC 0.75 IPC 1.0 IPC 1.25 IPC 1.5 IPC 1.75 IPC 2.0 IPC

  29. IPC Delta Scheduling 30 20 10 0 Percent Distance from Ideal 0.05 0.1 0.15 0.2 0.3 0.4 0.5 0.75 1 1.25 1.5 1.75 2 Migration Trigger - IPC Change

  30. Ongoing Investigations • Varying heterogeneity • cache sizes • floating-point availability • Determining migration indicators • cache misses • Combinations • Other software feedback

  31. Symbiotic Optimization • Cross-layer approaches can be powerful • Minor changes enable communication channels • Hardware feedback and binary modification can help solve the di/dt problem • Hardware feedback and program phases can guide OS scheduling decisions • The Tortola design can also target power reduction, temperature reduction, reliability, etc. • http://www.tortolaproject.com/

  32. Course Summary • Now you know … • Process VMs versus system VMs • Research issues in building process VMs • How to use process VMs for a variety of other research projects • Why cross-layer approaches can be powerful and how to use process VMs to get there • Thanks and enjoy the rest of your summer!

More Related