When the Autonomic Cloud Meets the Smart Grid IBM Smarter Planet, 11/20/09

When the Autonomic Cloud Meets the Smart Grid IBM Smarter Planet, 11/20/09 Jeff Chase Duke University

Server Energy • Many servers in large aggregations/farms. • Data centers (DC) • “Warehouse Computers” (WHC) • Modular shipping containers • These facilities burn a lot of energy. • 1% to 4%, depending on the study • …Of electricity/carbon • …In the US/world, now/soon • That energycosts a lot of money. EPA2007 report: data centers at 1.5% of US electricity: 60 TWh for $4.5B. Expected to double by 2011.

How much money? • TCO : energy cost exceeds server cost. [Uptime Institute for 2010] • Worldwide server market: $60B • Worldwide server power/cooling: $35B [IDC 2008]

Running to stand still • Their energy demand will grow. • Performance/watt doubles every 2 years. • But capacity demand grows faster. • Their share of electricity/carbon will grow. • Many “low hanging fruit” efficiencies elsewhere • The cost will grow too. • Peak of “easy” oil  more substitution of electricity for transport needs • Even the 450 ppm scenario requires massive reductions in climate-disrupting emissions.

IEA: no reason to fear “peak oil” Something will turn up. It always has.

How to reduce IT energy/cost? • Efficiency first • Reduces OpEx at peak demand level • Reduces CapEx for plant/power/cooling • Static optimization: “simply a matter of engineering” • DC Metric: Power Use Efficiency (PUE) • PUE = total power / power to servers • 1/PUE = Data Center Efficiency (DCE) • High-end: 75% of watts make it to the servers • The rest is cooling, power distribution etc. • Most data centers today are much worse!

Key Distinctions • Energy efficiency • Buy lights that generate more lumens per watt when they are on. • Energy proportionality • Turn lights off when you leave the room. • Burn power only when you need lumens. • Conservation, aka reduced service • Shiver in the dark • Short/cold showers and warm beer

Listen to this man • Goal: “uncompromised” design • Design for radical efficiency • But he says: make the software more efficient! Sure, but… • No scripting languages? • No XML? • Are high-productivity software environments “bad design”? • C: efficiency or conservation? Dr. Amory B. Lovins, rmi.org

Focus: Energy Proportionality Servers are rarely fully utilized. Internet services have periodic and variable load. Source: Akamai [Quershi09] (Bruce Maggs) Source: Google [Barroso/Holzle08]

Focus: Energy Proportionality • Dynamic range • 1 – (idle/peak) • Higher is better • Room to improve! • CPU: 70% + • Server: 50% • Cooling: LOW • Some progress… Source: Google [Barroso/Holzle08]

Focus: Energy Proportionality • Surplus capacity creates an opportunity for dynamic optimization. • Shift load to underutilized resources… • …in some other place • …at some other time. • Key idea: dynamic optimization at the scale of an aggregate can improve proportionality and reduce energy cost. • Reduces OpEx atnon-peakdemand level • But: does not reduce CapExfor plant/power/cooling

Managing Energy and Server Resources in Hosting Centers Jeff Chase, Darrell Anderson, Ron Doyle, Prachi Thakar, Amin Vahdat Duke University

Managing Energy and Server Resources • Key idea: a hosting center OS maintains the balance of requests and responses, energy inputs, and thermal outputs. • US in 2003: 22 TWh ($1B - $2B+) • Adaptively provision server resources to match request load. • Provision server resources for energy efficiency. • Degrade service on power/cooling failures. energy requests responses Power/cooling “browndown” Dynamic thermal management [Brooks] waste heat

Adaptive Provisioning - Efficient resource usage - Load multiplexing - Surge protection - Online capacity planning - Dynamic resource recruitment - Balance service quality with cost - Service Level Agreements (SLAs)

A B C D Energy vs. Service Quality A B Active set = {A,B,C,D} Active set = {A,B} • i<target • Low latency • i=target • Meets quality goals • Saves energy

Energy-Conscious Provisioning • Light load: concentrate traffic on a minimal set of servers. • Step down surplus servers to a low-power state. • APM and ACPI • Activate surplus servers on demand. • Wake-On-LAN • Browndown: can provision for a specified energy target.

Example 2: “Balance of Power” CRAC • Continuous thermal sensors in a data center • Infer “thermal topology” • Place workload to optimize cooling • Dynamic thermal management Temperature Scale (C) Rack Hot spots

The Importance of Being Idle • 100%: • No choices Only midrange has a useful spread between good choices and bad choices. • 0%: • No choices

Temperature-Aware Workload Placement • Less heat recirculation  lower cooling power cost • “Hot spots” can be OK and beneficial, provided heat exits • Avoid servers whose exhaust recirculates Moore05Usenix

Demand Side Management for the Smart Grid • Electricity supply and demand are also highly variable. • “Smart grid” matches supply and demand. • If we have: • variable electricity pricing • surplus server capacity • energy proportionality • …can we place workload to minimize cost? • …without violating SLA?

Peak: $250 per MWh Prices vary across markets: ripe for arbitrage! Trough: $25 per MWh Demand is predictable. http://www.ferc.gov/market-oversight/mkt-electric/pjm/2008/12-2008-elec-pjm-dly.pdf

Demand peaks cheap-and-dirty power

Shape the Demand Curve? • Statistical multiplexing is not enough • Not for networks • Not for smart clouds or smart (electrical) grids • Wide variance in aggregate demand • Congestion  higher price, higher carbon footprint • Demand-side management offers a smoother ride. James Hamilton source: http://perspectives.mvdirona.com/CommentView,guid,e7848cf7-5430-49bf-a3d0-d699bec2a055.aspx

cutting the electric bill for internet-scale systems AsfandyarQureshi(MIT) Rick Weber (Akamai) HariBalakrishnan (MIT) John Guttag (MIT) Bruce Maggs (Duke/CMU/Akamai) Éole @ flickr

context: massive systems Google: • estimated map • tens of locations in the US • >0.5M servers major data center others • thousands of servers / multiple locations • Amazon, Yahoo!, Microsoft, Akamai • Bank of America (≈50 locations), Reuters Qureshi • SIGCOMM • August 2009 • Barcelona • Spain

request routing framework capacity constraints latency goals network topology bandwidth price model performance aware routing best-price performance aware routing map: requests to locations requests electricity prices (hourly) Qureshi • SIGCOMM • August 2009 • Barcelona • Spain

importance of elasticity 2011 PUE & active server scaling off the rack servers Google circa 2008 savings (%) $3M+ 8% $2M 5% $1M+ 3% Idle: 2.0 65% 65% 1.3 1.7 33% 33% 1.3 25% 1.3 1.1 0% 0% 1.0 PUE: energy model parameters increasing energy proportionality Qureshi • SIGCOMM • August 2009 • Barcelona • Spain

The Elasticity of Power • Clouds: “boundless infrastructure on demand” • Elasticity: Grow/shrink resource slices as required 1.Can I have more resources? 2. Here you go: N more servers

Demand Side Management • Reflection in elastic cloud applications: • Adapt behavior based on resource availability • Opportunistically exploit surplus resources • Defer/avoid work during congestion 2. What useful work should I do? 1.Energy is cheap right now. 3. I will use N more servers.

Reflective Control • Reflection in elastic cloud applications: • Adapt behavior based on resource availability • Opportunistically exploit surplus resources • Defer/avoid work during congestion • Requires deeper integrated control 2. What useful work should I do? 1. Energy is getting more expensive now. 3. I will use fewer servers

DSM/Reflection: Challenges • Multiple objectives: deadline, budget, accuracy • How much parallelism for opportunistic/speculative use? • Does it generalize? To what extent can we “factor out” reflective policies from applications? Better What does it require from the “cool cloud”? Faster Cheaper

Workbench-assisted benchmarking • Goal: response surface map: peak rate= • Parallelism • Data dependency at each point in surface • Partition surface arbitrarily: embarrassingly parallel • What experiments to run? • Need notion of experiment utility: u(e) • Highly selective sampling

Better? Cheaper? Faster? • How long should I run each experiment? • Which search techniques can I use? • How to quantify cost? • Do I need more samples? • How many times to repeat each experiment? • Do I have enough resources? Get more or return some? • Wait for more or run lower rank experiments with what I have now? Can I meet my deadline? Will I have sufficient confidence in the result?

Gang Computing Faculty-owned clusters in closets

Gang Computing Aggregation Substrate “socket” “Shareholders” Provider (University OIT)

Gang Computing: Value Flow Ease of use Protect/enhance CapEx Zero OpEx Sharing the surplus Substrate “socket” Economies of scale Control over surplus Enhancement Efficiency Shareholders Provider (University OIT)

Gang Computing: Value Flow Ease of use Protect/enhance CapEx Zero OpEx Surplus access $$$ $ Substrate “socket” Economies of scale Control over surplus Enhancement Efficiency $$$ Shareholders Provider (OIT)

When the Autonomic Cloud Meets the Smart Grid IBM Smarter Planet, 11/20/09