1 / 24

Dynamic Thermal Management for Data Centers: Get Smart with ConSil

ConSil is a system designed to analyze data center thermals, manage heat proactively, and promote an even temperature distribution through temperature-aware workload placement.

pascua
Download Presentation

Dynamic Thermal Management for Data Centers: Get Smart with ConSil

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ConSil Jeff Chase Duke University

  2. Collaborators • Justin Moore • received PhD in April, en route to Google. • Did this research. • Wrote this paper. • Named the system. • Something to do with “Get Smart” (?) • Did not send me slides… • Partha Ranganathan (HP) has led this work.

  3. Context: Dynamic Thermal Management for Data Centers CRAC Temperature Scale (C) Rack Heat build-ups

  4. Goals • ConSil is part of a larger system to analyze data center thermals and manage heat proactively. • Temperature-aware workload placement • “Smart cooling” • Preliminary conclusion: it is practical to reduce total energy by about 15% under “typical” conditions. • Your mileage may vary. • Other goals: • Reduce capital cost with “common case” cooling system. • Allow cluster to “burst”, but stop short of meltdown. • Improve long-term reliability and availability • Better data center design

  5. “Green” Workload Placement Place workload intelligently to promote an even temperature distribution, given the “thermal topology” of the data center. Making Scheduling "Cool": Temperature-Aware Resource Assignment in Data Centers by Justin Moore, J. Chase, P. Ranganathan, and R. Sharma. In the 2005 USENIX Annual Technical Conference, April 2005

  6. The Subproblem that Consil Solves • How hot is point (x, y, z) in your data center? • Placement policies need a thermal map • Option 1: install new instrumentation • Tradeoff $$$ vs. granularity • Option 2: use built-in sensors • But: how to derive the inlet temperatures? • If we can do that, then we can obtain a precise and accurate thermal map with low instrumentation cost.

  7. Thermal Instrumentation Observed: ▲= f(▲, ▲) Learn: ▲= g(▲, ▲) Heat Sources (Qworkload) Inlet Heat (Qinlet) Temperature Sensors (Qobserved)

  8. ConSil in Context Workload measures

  9. Attributes Samples X1 X2 . . Xn Y s1 s11 s12 . . s1n Y1 s2 s21 s22 . . s2n Y2 . . . . . . . sm sm1 sm2 . . smn Ym Learning a Model • Learn statistical model for Y from m samples of

  10. First Cut: Neural Nets • Infer ambient temperature from an input sample: • Last N workload measure samples (epoch E) • Internal temperature sensor readings • Use off-the-shelf FANN library • Some static (SWAG) structural choices: • Four layers of neurons • Inputhiddenhiddenoutput • Neurons use FANN sigmoid transform function • Train the net using FANN back-propagation to set input weights on each neuron.

  11. Experiments with Consil • Collected data for 12 servers in a data center. • Pick servers whose inlet temperatures are known • i.e., they have a sensor near them • 45 hours of data collected under active/varying load • Two server models (HP DL360 G3, Dell 1425) • CPU data: 1 second granularities • temperature data: 5 or 30 second granularities • CPU utilization only • CPU uses 80% of power (225/275 watts peak) • 266 Lines of FANN code

  12. Methodology • FFCV • Divide observations into fifths • Train on one fifth, test on four • Do it for each fifth • Compute SSE • Output: CDFs of errors • Sensitivity study • Training time • Accuracy

  13. ConSil: Accuracy • Accurate inference using workload and onboard data • 75% of inferred values are within 1C of actual value

  14. Sensitivity • Time-to-train • Most significant: FFCV sub-experiment • Training time is highly data-dependent • Epoch length • Number of sensor/workload epochs • Accuracy (SSE) • Most significant: FFCV sub-experiment • Indicates not enough variation in behavior • Coarse granularity (more history) improves

  15. ConSil in Context Workload measures

  16. Predicting Thermal Effects • Model relationship using machine learning • Inputs: Workload data, AC settings, fan speeds • Output: Predicted thermal map • Learns from observations during normal operation • FANN neural net library • Active “burn in” may speed learning Weatherman: Automated, Online, and Predictive Thermal Mapping and Management for Data Centers by Justin Moore, J. Chase, and P. Ranganathan. Third IEEE International Conference on Autonomic Computing, June 2006.

  17. Weatherman: Accuracy • Accurate inferences using workload and AC data • Data from validated Flovent CFD models • 92% of predicted values are within 1.0C of actual value

  18. Summary/Conclusion • Machine learning is a useful tool for “autonomic” self-optimization. • Sense and respond • Optimizing control loops based on learned models • Neural nets don’t always suck. • Initial results suggest they work well here. • Maybe we can do better. • Need good baseline datasets for training/validation. • Variance • History

  19. Why “ConSil”? • Cone of Silence • “Mask out” unwanted signals

  20. http://www.cs.duke.edu/~chase

  21. The maximum number of training iterations was set to $10^5$. Each neural net contained one input, one output, and two hidden layers. Each hidden layer contained twice the number of neurons as the input layer; varying the number of recent epochs we use as input, we vary the number of workload epochs~---~parameter $B$~---~and internal sensor epochs~---~parameter $C$~---~independently. Using general full factorial design analysis, we can identify which parameters have a significant effect when changed, and for which parameters we can simply select a ``reasonable'' value.

More Related