- 121 Views
- Uploaded on

Download Presentation
## Managing MPSoCs beyond their Thermal Design Power *

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

**Managing MPSoCs beyond their Thermal Design Power***Luca Benini Università di Bologna & STMicroelectronics Luca.benini@unibo.it *ACK: INTEL, FP7: THERMINATOR, Artist-Design, ERC-Multitherman**Mobile SoCs are cool…right?**Wrong! So.. got myself a HTC (T-Mobile) HD2 […]. As i found out the problem is pretty common: damn thing restart itself - thermally related - the old CPU overheat problem. By searching the net I found out that it's pretty common with some HTC models. HD2 has it, Desire has it, Nexus One has it, hell even some xperia models have it.. about half of the devices powered by anything from the Snapdragon series could have it. June 2011 - http://forum.xda-developers.com/showthread.php?t=982454 Why? ARM has unveiled its next generation "Eagle" (Cortex A15) processor, pitching it at everything from smartphones to energy-efficient servers. The A15 will be initially produced at 32nm or 28nm, although ARM claims the roadmap stretches down to 20nm. It will deliver clock speeds of up to 2.5GHz. Aug. 2011 - http://www.pcpro.co.uk/news/360994/arm-preys-on-smartphones-and-servers-with-eagle**A 2011 Mobile SoC**NVIDIA Tegra II SoC • Tegra II • TSMC 40nm (LP/G) • A9 - 1GHz (G) • GPU, etc. - 330MHz (LP) • GEForce ULV (8 shaders) • 2 separate Vddrails • 1MB L2$ • 32b LPDDR2 (600MHz DR) • Tegra II 3D • A9 – 1.2GHz • GPU – 400Mhz SoC+DRAMPoP**3D-SoCs are even worse**Powerwall Thermal wall**Outline**• Introduction • Scalable Control • Scalable model learning • Experimental Environment • Challenges ahead**DRM - GeneralArchitecture**App.N App.1 WLei • Controller • Minimize energy • Bound the CPU temperature • Layered approach • Stack of controllers • Energy controller • Thermal controller ThreadN ThreadN Thread 1 Thread 1 • System (Chip Scale) • Sensors • Performance counter - PMU • Core temperature • Actuator - Knobs • ACPI states • P-State DVFS • C-State PGATING • Task allocation • Controller • Reactive • Threshold/Heuristic • Controller theory • Proactive • Predictors ... ... ....... Energy CPU2 CPUN CPU1 L1 L1 Ti O.S CONTROLLER CONTROLLER L1 fECi SW L2 L2 HW Thermal f,v Network PGATING fTCi Simulation snap-shot DRAM TCPU,#L1MISS,#BUSACCESS,CYCLEACTIVE,....**Pn,j**Tn,j task task task Thermal & Power Modeling Tj Pj Taskj Modello di potenza Power model Modello Termico Thermal model • Linear dynamicstate space model P=g(task,f) Non linear**Thermal transient**2 3 4 5 6 7 1 8 1 2 Thermal locality (Direct Fourier law implication): • Continuous model: • Thermal neighborhood = Physical neighborhood • Discrete model: • Thermal neighborhood depends on sample time • Hotspot simulation of ‘Intel SCC like’ 48core • Each core : Area = 11.82mm2, Pmax = 2.6W • We powered on only Core(5,3) • T neighborhood > +0.1°C • Thermal transient – Model Order • Different materials reflects in different time constants [1] • Silicon die, heatspreader, heatsink • Second order model 3 4 5 6 • O.S time scale Neighborhood Ts < 2ms Ts < 0.25s Ts < 50ms Ts < 0.5s Ts < 75ms Ts > 0.5s Ts < 0.1s [1] W. Huang Differentiating the roles of IR measurement and simulation for power and temperature-aware design 2009.**Thermal Controller**Model Predictive Controller • Internal prediction: avoid overshoot • Optimization: maximizes performance • Centralized • aware of neighbor cores thermal influence • All at once – MIMO controller • Complexity !!! COMPLEXITY • [Intel®, ISSCC 2007] Classical feed-back controller • PID controllers • Better than threshold based approach • Cannot prevent overshoot Threshold basedcontroller • T > Tmax low freq • T < Tmin high freq • cannot prevent overshoot • thermal cycle Target frequency Past input & output + Future output Thermal Model - Future error Future input Optimizer Cost function Constraint MPC**MPC Scalability**MPC Complexity • Implicit - a.k.a. on-line • computational burden • Explicit – a.k.a. off-line • high memory occupation • [Intel, ISSCC 2007] • [Intel, ISSCC 2007] • [Intel, ISSCC 2007] • [Intel, ISSCC 2007] 1296 Complexity growssuperlinearlywith number of cores!! Out ofmemory # Explicitregions 36 4 8 16 # CORES**AddressingScalability**On a short time window, power has a local thermal effect! One controller for each core Controller uses: • local power & thermal model • neighbor’s temperatures Fully distributed Complexity scales linearly with #cores 1296 Out ofmemory # Explicitregions 8 16 36 32 4 8 16 # CORES**Distributed Control**Distributed and hierarchical controllers: • Energy Controller (EC) • Output Frequency fEC • Minimize power – CPI based • Performance degradation < 5% • Temperature Controller (TC) • Distributed MPC • Inputs: • fEC ,TCORE, TNEIGHBOURS • Output • Core frequency (fTC) controller node CPI N,k+1 CPI i,k+1 ECN TCN ECi TCi fEC N,k+1 Corei Core1 TN-1,k fEC i,k+1 fTC N,k+1 Tx,k Ti+1,k T i,k Ti-1,k CoreN T i,k multicore fTC i,k+1 T2,k**High Level Architecture**f1,EC T1+Tneigh Thermal Controller Core 1 CPI f1,TC f1,EC T1+Tneigh Thermal Controller Core 2 CPI f2,TC PLANT Energy Controller f1,EC f3,TC Thermal Controller Core 3 CPI T3+Tneigh f1,EC Thermal Controller Core 4 f4,TC CPI T4+Tneigh Distributed Controller**Distributed Thermal Controller**MPC Controller Core 1 CPI CPI f1,TC P1,EC MPC Controller g(·) g-1(·) f1,EC P1,TC Linear Model TENV x1 QP Optimiz 2 states per core Observer T1 Implicit formulation Nonlinear (FrequencytoPower) s.t Linear (Powerto Temperature) ClassicLuenberger state observer**MPC trade-off**Trace DrivenSimulation(Matlab) – gold model • Parsec trace obtained on real HW • Power Model: Nonlinear vs. linear • Thermal Model:one vs. two • Centralized vs. Distributed**Outline**• Introduction • Scalable Control • Scalable model learning • Experimental Environment • Challenges Ahead**Model Identification**System Identification • MPC needs a Thermal Model • Accurate, with low complexity • Must be known “at priori” • Depends on user configuration • Changes with system ageing Identified State-Space Thermal Model Temperature Corei Core1 Power Workload execution CoreN Training tasks Workload Workload Workload Workload Workload Workload Workload “In field” Self-Calibration Core1 CoreN t t t t t t t Corei Workload • Force test workloads • Measure cores temperatures • System identification multicore Target frequency Past input & output + Future output Thermal Model - Future error Future input Optimizer Cost function Constraint MPC**Parametricoptimization**Parameters Cost function i e = Tmodel - Treal ErrorFunction LS System Identification Test pattern A,B Data Model least square optimization Optimal parameters Systemresponse**Tcore2**42.5 42 °C 41.5 41 Measured 40.5 Thermal Model 20.5 21 21.5 22 22.5 23 seconds Black-box Identification Identification based on pure LS fitting Problem: unstable Model**Gray Box identification**LS model must be constrained by physical properties to avoid over-fitting Tcore Tcore cannot be ≠ Tamb with P=0 Tamb t**Physical Constraints**Linear constraint Constraint on initialcondition CONSTRAINED LEAST SQUARES**Model learningScalability**Complexity issue • State-of-the-art is centralized MIMO • Least square based – is based on matrix inversion (cubic with #cores) System Identification MPC Weaknesses – 2nd InternalThermalModel • Accurate, with low complexity • Mustbeknown “at priori” • Depends on userconfiguration • Changeswith system ageing Identified State-Space Thermal Model Distributed approach: each core identifies its local thermal model Complexity scales linearly with #cores 1230 Corei >1h Core1 Workload execution CoreN Training tasks Workload Workload Workload Workload Workload Workload Workload Time - s “In field”Self-Calibration Core1 CoreN t t t t t t 105 t Corei Workload • Force test workloads • Measurecorestemperatures • System identification multicore 1 2 4 8 # CORES 4 16 Target frequency Temperature Past input & output + Future output Thermal Model - Future error Future input Power Optimizer Cost function Constraint MPC**Distributed Model learning**Distributed MISO identification State Space Model • ARX model ARX Model Corei TneigW,i(k-s) TneigE,i(k-s) Modeli AutoRegressive term Pi(k-s) Power input Ti(k-s) Neightbours temperatures Errors (disturbances, measurementerrors) TneigN,i(k-s)**Distributed model-learning**Distributed identification • Objective: find ai and bi,j • System data collection: • input: PRBS signals to all cores (persistently exciting inpt sequence) • output: Temperatures of all cores (Tiocon i= # core) • ARX model • Parameters computation: • Tp(ai, bi,j)=T(k+1|k) computed with previous equation • To output temperature (measured) • Least square algorithm: • Computation • Algorithm • Results • 1) Mean Absolute Error between original and identified models • 2) Temperature response of core 1**Outline**• Introduction • Scalable Control • Scalable model learning • Experimental Environment • Challenges ahead**Virtual Platform**App.N App.1 • CONTROL-STRATEGIES DEVELOPMENT CYCLE • Controller design in MathworksMatlab/Simulink framework • system represented by a simplified model • obtained by physical considerations and identification techniques • Set of simulation tests and design adjustments done in Simulink • Tuned controller evaluationwith an accurate model of the plant done in the virtual platform • Performance analysis, by simulating the overall system • MathworksMatlab interface: • New module named Controller in RUBY • Initialization: starts the MathworksMatlab engine concurrent process, • Every N cycle - wake-up: • send the current performance monitor output to the MathworksSimulink model • execute one step of the controller MathworksSimulink model • propagate the MathworksSimulink controller decision to the DVFS module TN TN T1 T1 T ... ... PCORE, PL1, PL2 TEMPERATURE MODEL MATLAB Interface RUBY CORE POWER MODEL f,v Stall L1 L1 L1 PGATING fi ,VDD L2 L2 fi Network DVFS PC DRAM Simulator MathworksMatlab Controller P,T TCPU, fi #hlt,stall #L1MISS, #BUSACCESS, CYCLEACTIVE,.... Virtual Platform active,cycles P .... O.S. SW HW CPI CPU1 CPU2 CPUN Mem Access Virtutech Simics**Results**Energy Controller (EC) Performance Loss < 5% Energy minimization Temperature Controller (TC) Complexity reduction 2 explicit region for controller Performs as the centralized Thermal capping <0.3° <3% <3%**C/FORTRAN**(SLICOT,MINPACK) • Matlab System Identification • N4SID • PEM • LS (Levenberg-Marquardt) SUN FIRE X4270 Air flow • Intel Nehalem • 8core/16thread • 2.9GHz • 95W TDP • IPMI Chipset RAM RAM CPU2 CPU2 CPU1 CPU1 • Temperature • Frequency • Workload ..idle,idle,run,run,run,… Fan board Storage drives Working on Real Chips (Intel) MODEL Single Chip Cloud (45nm) MODEL LS LS • 567.1 mm2 • 48cores @1GHz • 2GHz NoC • 25-125W • 27 (f), 8 (V) islands DATA.csv PREPROCESSING PREPROCESSING XTS (Ts=1/10ms) core0 core1 coreN HW ..0,0,1,1,1,… PRBS.csv Workloader Pattern Generator**SCC thermal calibration**• Spatial Variation in raw sensor values (> 50%) • Hot –Cold < 20% Intel Single-Chip Cloud Computing thermal sensors and model calibration Router Sensor Core Sensor Thermal sensors are ring oscillators #cycles = a + b*Temperature We need to find a,b for each core!! System level - thermalsensorcharacterizzation • T =k*PCORE + TAMB • #cycles = a+b*k*PCORE+ b*TAMBleastsquare • (#cycles,PCORE + b*TAMB) {a,b} @ const. Vdd @steady state Need of Calibration !**Outline**• Introduction • Scalable Control • Scalable model learning • Experimental Environment • Challenges ahead**The 1,000 Cores Chip**• STM-CEA Platform 2012 project • Die with 4 16-cores tiles with L1 & L2 few tens of mm2 (28nm) • SCC die 20 of these dies: 1,280 cores • Thousands of Vdd, f domains • 3D stacking currently the only technology which can provide sufficient L3 bandwidth • Vertical thermal dissipation! • Heterogeneous requirements (DRAM≠LOGIC) • Major static and dynamic variability**Power management Challenges**• Truly scalable algorithms O(NlogN) • Hardware support needed (e.g. DPM NoC) • Cross-layer algos are needed • Real-time intra+inter layer communication • Abstraction and filtering • Multi-scale • The threat of non-linearity • Hybrid control complexity (MILP is NP-HARD) • Lack of robustness (Ill-conditioning) and stability proofs SoCs as complex systems (societies/markets) DPM as political sociology/finance?**Management Loop: Holistic view**SW HW Perf. Counters Reliability alarms Temperature Power Core 1 Core N … Proc. 1 Proc. 2 INTERCONNECT … Private Mem Private Mem Workload CPU utilization, queue status System information from OS MW: Job manager OS: Scheduler SW-introspection: system tracing SW-control: scheduling, allocation Resource managementpolicy HW-Introspection: monitors HW-control:knobs (Vdd, Vb, f,on/off ) Multicore Platform Proc. N Private Mem

Download Presentation

Connecting to Server..