1 / 41

D ynamic Frequency- V oltage S caling for M ultiple C lock D omain Processors

D ynamic Frequency- V oltage S caling for M ultiple C lock D omain Processors. and Implications on Asymmetric M ultiple C ore P rocessors. Avshalom Elyada. Based primarily on the work of Greg Semeraro, David H.Albonesi et. al. University of Rochester, NY. And also

idola
Download Presentation

D ynamic Frequency- V oltage S caling for M ultiple C lock D omain Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dynamic Frequency-Voltage ScalingforMultiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

  2. Based primarily on the work of Greg Semeraro, David H.Albonesi et. al. University of Rochester, NY. And also Diana Marculescu et. al. Carnegie Mellon University, PA. DVS, Avshalom Elyada, EE Faculty, Technion

  3. Outline • Multiple Clock Domains • Inter-domain communication and synchronization • Dynamic Frequency-Voltage Scaling • Scaling algorithms • Offline, Attack-Decay, Dynamic Profiling • Results comparison • DVS in Multiple Core Processors...? DVS, Avshalom Elyada, EE Faculty, Technion

  4. End of the Road forGlobally-Synchronous • Global hi-freq clock does not scale well • Low clock reachability within a single clock cycle • Interconnect does not scale well • Clock-tree complexity, skew, power-inefficiency DVS, Avshalom Elyada, EE Faculty, Technion

  5. Multiple-Clock-Domainsor Globally Asynchronous Locally Synchronous • Divide core into separate clock domains • Synchronize communication between synchronous “islands” • Speedup freq of separate smaller domains • Good inter-domain communication design • To minimize synchronization performance costs • Retain traditional synchronous knowledge-base DVS, Avshalom Elyada, EE Faculty, Technion

  6. MCD Processor(Alpha 21264–like Model, Rochester DVS research) DVS, Avshalom Elyada, EE Faculty, Technion

  7. Multi-Synchronous GloballySynchronous Single clock Multi- Synchronous Each domainseparate clockat same frequency MCD DVS, Avshalom Elyada, EE Faculty, Technion

  8. Dynamic Frequency-Voltage Scaling • If all domains always run at max freq, this is usually a waste of power • Only critical domain need run at max freq, others can run slower • This saves power • Performance degradation should be minimal DVS, Avshalom Elyada, EE Faculty, Technion

  9. MCD and GALS GloballySynchronous Single clock Multi-Synchronous Each domainseparate clockat same frequency Globally Async Locally Sync Async domains: Different frequencyper domain MCD DVS, Avshalom Elyada, EE Faculty, Technion

  10. Integer Dominated DVS, Avshalom Elyada, EE Faculty, Technion

  11. Load-Store Dominated DVS, Avshalom Elyada, EE Faculty, Technion

  12. D(F)VS Continued • 20-40% Energy-Delay improvement • Voltage scales down with freq, saving additional power: • Potential for X3 savings • Careful : wrong scaling is catastrophic on performance DVS, Avshalom Elyada, EE Faculty, Technion

  13. Scaling is Gradual and Occurs During Regular Operation • F may be decreased before V decreased • V must be increased before F may increase Freq (MHz) F-V working points 729.6 727.3 Voltage 1.000V 1.172V DVS, Avshalom Elyada, EE Faculty, Technion

  14. MCD and GALS GloballySynchronous Single clock Multi-Synchronous Each domainseparate clockat same frequency DVS (C-GALS) Different frequency per domain Centrally controlled GALS Async domains: Different frequencyper domain Autonomous MCD DVS, Avshalom Elyada, EE Faculty, Technion

  15. Configuration Parameters (XScale-like) • 320 Frequency-Voltage working-points • Freq range 250-1000 MHz • Voltage range 0.65-1.20 V • Step between work-points: 0.172 mV / 2.34 MHz • Change rate: 0.172 uSec / Step (55uSec end-to-end) • Time step: change each 50K cycles DVS, Avshalom Elyada, EE Faculty, Technion

  16. DVS per domain - Considerations • Scaling algorithm: • Determine F-V point of each domain at any time • Temporal granularity • how often to change the F-V point • Synchronization • Multi-Sync - all domains run @ same freq • Simple sync solutions exist (phase compensation) • When GALS – different and changing frequencies • Asynchronous sync. solution, impedes performance • Or think of better solutions… DVS, Avshalom Elyada, EE Faculty, Technion

  17. Power-bounded DVS • Given power envelope • Mobilize energy between domains to attain max performance DVS, Avshalom Elyada, EE Faculty, Technion

  18. Scaling Algorithm • Input : A serial program • Output: Parallel, temporal specification of which domains slowed by how much • Temporal Granularity • Time-step should be short enough to be dynamic • Too short ineffective due to: • Gradual scaling • Overhead of the change DVS, Avshalom Elyada, EE Faculty, Technion

  19. Scaling Algorithms • ‘Offline’ Algorithm • Full preparation on a simulator • Insert F-V config instructions for actual run • ‘Online’ (Attack-Decay) • Done entirely in hardware • Rescale F-V acc. to internal queue levels • Dynamic Profiling • Short profile run, find program phases • Rescale F-V on phase transitions DVS, Avshalom Elyada, EE Faculty, Technion

  20. Offline Algorithm • Run the program on a simulator at max speed, trace Primitive Events • Primitive event = work performed in single domain on behalf of single instruction • Construct Directed Acyclic Graph • functional and data dependencies between primitive events • Arcs represent time between events DVS, Avshalom Elyada, EE Faculty, Technion

  21. Offline Algorithm Contd • Slack appears on non-critical paths • Stretch events that are not in critical time path Stretch Slack DVS, Avshalom Elyada, EE Faculty, Technion

  22. Offline Algorithm Contd. • Now we have desired scale-down of single primitive events • Need to scale down domains per time-step • Construct Event Histograms per domain per time-step: H(domain, time-step) • Assign tolerable performance degradation %p • Determine actual scale-down per-domain according to (H, p) DVS, Avshalom Elyada, EE Faculty, Technion

  23. OnlineAlgorithm queue full • Each time step, sample input queue levels • Attack: if queue level up by ~2%, inc freq by 6% • Decay: if level unchanged, dec freq ~0.2% • Simple, HW only, results ~70% of offline • Watch out for perturbations, local-minima, over-activism & other feedback-related pitfalls freq DVS, Avshalom Elyada, EE Faculty, Technion

  24. DVS, Avshalom Elyada, EE Faculty, Technion

  25. Dynamic Profiling • Execution shows repeating ProgramPhases • Phase often delimited by subroutine call or loop • Dynamic Profiling: • Identify phases by a short profiling run • Insert phase marks and FV config into program • When program reaches a mark, reconfig FV DVS, Avshalom Elyada, EE Faculty, Technion

  26. Results Comparison DVS, Avshalom Elyada, EE Faculty, Technion

  27. Improved Dynamic Profiling • Each program will carry its phase-information as initial setup data • Assuming phase info not processor-specific • alternatively, processor-specific compilation • Or, processor itself will perform the profile run • HW based dynamic profiling,eliminating the need forsimulation pre-run DVS, Avshalom Elyada, EE Faculty, Technion

  28. DVS in ACCMP • Conceptual Difference: • MCD Processor: sub-units run @ diff. freq. • MCP: Threads run @ diff. freq. • ACCMP - different size cores • ACCMP with DVS - Cores also dynamically change frequency DVS, Avshalom Elyada, EE Faculty, Technion

  29. M L S DVS - Degree of Freedom • ACCMP • Allocate thread to static strength processor: S M L performance • ACCMP with DVS • Scale processor to performance needs • Dynamically accommodate Stretch-fit 40-50 36-44 40-50 32-38 36-44 32-38 DVS, Avshalom Elyada, EE Faculty, Technion

  30. Dynamic Thread Allocation • 3 sizes DVS processors DVS, Avshalom Elyada, EE Faculty, Technion

  31. Dynamic Thread Allocation • 3 sizes DVS processors • Thread “wants” performance between M & L processors DVS, Avshalom Elyada, EE Faculty, Technion

  32. Dynamic Thread Allocation • 3 sizes DVS processors • Thread “wants” performance between M & L processors • Allocate to M only, hurt performance, but still better than static ACCMP DVS, Avshalom Elyada, EE Faculty, Technion

  33. Dynamic Thread Allocation • 3 sizes DVS processors • Thread “wants” performance between M & L processors • Allocate to M only, hurt performance, but still better than static ACCMP • To L only, waste power DVS, Avshalom Elyada, EE Faculty, Technion

  34. Dynamic Thread Allocation • 3 sizes DVS processors • Thread “wants” performance between M & L processors • Allocate to M only, hurt performance, but still better than static ACCMP • To L only, waste power • Or migrate between both, acc. to performance needs • What is best? DVS, Avshalom Elyada, EE Faculty, Technion

  35. Migration • k Migrations M↔L processors • Phases φM, φL on each of the processors DVS, Avshalom Elyada, EE Faculty, Technion

  36. The End DVS, Avshalom Elyada, EE Faculty, Technion

  37. DVS in Multiple Core Processors • Asymmetric Cores • Asymmetric size cores suggested to better utilize die area when too few threads • But research shows symmetric cores perform better when have enough threads • With DVS, a core’s performance dynamically varies acc. to freq. • Viewed in a Performance/Energy metric, this is a more flexible kind of asymmetry … • Also Simplify SW decision of which thread to assign to which asymmetric core DVS, Avshalom Elyada, EE Faculty, Technion

  38. Inter-Domain Communication • In order to minimize synchronization penalty • divide area into domains where there inherently exists a dual-port queue structure • Dual-port FIFO synchronization solution • Otherwise divide where minimum inter-domain communication Producer Domain Consumer Domain Dual-PortFIFOsynchronizer wclk rclk wen ren wdata rdata full empty DVS, Avshalom Elyada, EE Faculty, Technion

  39. Dual-Port FIFO • Producer/Consumer domains can write/read independently as long as FIFO is not full or empty • Full & Empty are the only signals that need syncing • Therefore sync penalty incurred only when FIFO is full or empty DVS, Avshalom Elyada, EE Faculty, Technion

  40. Syncing Periodic Domains • Synchronization solutions which exploit no knowledge of clock relations are sub-optimal • Examples: two-flop and even dual-port FIFO • DVS: clock relations are Periodic, Dynamic, and Known • Predictive Synchronizer can predict when conflict will occur between different periodic clocks • But conflict prediction sometimes adapts slowly to freq changes • DVS makes possible to exploit the fact that domain frequencies are Known • Propose a multi-freq. sync. that can detect conflict by knowing at which freq. it’s provider and consumer run DVS, Avshalom Elyada, EE Faculty, Technion

  41. Gradual Scaling • Device works throughout the change • Necessary for 2 reasons • Online algorithm based on steadily changing feedback control • ? Synchronizers can’t cope with step-change • Using Dynamic Profiling + adequate synchronizers, can do instant scaling DVS, Avshalom Elyada, EE Faculty, Technion

More Related