1 / 57

Addressing shared resource contention in datacenter servers

Addressing shared resource contention in datacenter servers. Colloquium Talk by Sergey Blagodurov http://www.sfu.ca/~sba70/. Stony Brook University Fall 2013. Academic research at Simon Fraser University:

judd
Download Presentation

Addressing shared resource contention in datacenter servers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Addressing shared resource contention in datacenter servers Colloquium Talk by Sergey Blagodurov http://www.sfu.ca/~sba70/ Stony Brook University Fall 2013

  2. Academic research at Simon Fraser University: • I am finishing my PhD with Prof. Alexandra Fedorova • My work is on scheduling in High Performance Computing (HPC) clusters I prototypebetter datacenters! • Industrial research at Hewlett-Packard Laboratories: • I am Research Associate at Sustainable Ecosystems Research Group • My work is on designing a net-zero energy cloud infrastructure My research (40,000 feet view) -2- Talk by Sergey Blagodurov Stony Brook University

  3. #1 Dematerialization Online shopping – less driving Working from home Digital content delivery Why datacenters are important? -3- Talk by Sergey Blagodurov Stony Brook University

  4. #2 Moving into cloud Why datacenters are important? -4- Talk by Sergey Blagodurov Stony Brook University

  5. #3 Increasing demand for supercomputers The biggest scientific discoveries Tremendous cost savings Medical innovations Why datacenters are important? -5- Talk by Sergey Blagodurov Stony Brook University

  6. Datacenters use lots of energy: • Consumption rose by 60% in the last five years • More than the entire country of Mexico! • now ~1-2% of world electricity • Seawater hydro-electric storage on Okinawa, Japan Datacenters consume lots of energy and its getting worse! • Typicalelectricitycosts per year: • Google (>500K servers, ~72MW): $38M • Microsoft (>200K servers, ~68MW): $36M • Sequoia (~100K nodes, 8MW): $7M Why doing research in datacenters? -6- Talk by Sergey Blagodurov Stony Brook University

  7. 20 MW 24/7 datacenter that is on for 1 year is equivalent to: • 23k cars in annual greenhouse gas emissions • CO2emissions from the electricity use of 15k homes for one year • A single datacenter generates as much greenhouse gas as a small city! Why doing research in datacenters? -7- Talk by Sergey Blagodurov Stony Brook University

  8. CPU and Memory • are the biggest consumers Cooling and other infrastructure: 10-30% Servers: 70-90% Where do datacenters spend energy? -8- Talk by Sergey Blagodurov Stony Brook University

  9. NUMA Domain 0 Core 0 Core 1 Core 2 Core 3 L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache Shared L3 Cache System Request InterfaceCrossbar switch Memory Controller HyperTransport to other domains Memory node 0 An AMD Opteron 8356 Barcelona domain -9- Talk by Sergey Blagodurov Stony Brook University

  10. Core 3 Core 7 Core 11 Core 15 Core 0 Core 4 Core 8 Core 12 NUMA Domain 0 NUMA Domain 1 L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache Shared L3 Cache Shared L3 Cache MC HT MC HT Memory node 0 Memory node 1 Memory node 2 Memory node 3 MC MC HT HT Shared L3 Cache Shared L3 Cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache NUMA Domain 2 NUMA Domain 3 Core 1 Core 5 Core 9 Core 13 Core 2 Core 6 Core 10 Core 14 An AMD Opteron system with4 domains -10- Talk by Sergey Blagodurov Stony Brook University

  11. Core 3 Core 7 Core 11 Core 15 Core 0 Core 4 Core 8 Core 12 NUMA Domain 0 NUMA Domain 1 L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache Shared L3 Cache Shared L3 Cache MC HT MC HT Memory node 0 Memory node 1 Memory node 2 Memory node 3 MC MC HT HT Shared L3 Cache Shared L3 Cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache NUMA Domain 2 NUMA Domain 3 Core 1 Core 5 Core 9 Core 13 Core 2 Core 6 Core 10 Core 14 Contention for the shared last-level cache (CA) -11- Talk by Sergey Blagodurov Stony Brook University

  12. Core 3 Core 7 Core 11 Core 15 Core 0 Core 4 Core 8 Core 12 NUMA Domain 0 NUMA Domain 1 L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache Shared L3 Cache Shared L3 Cache MC HT MC HT Memory node 0 Memory node 1 Memory node 2 Memory node 3 MC MC HT HT Shared L3 Cache Shared L3 Cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache NUMA Domain 2 NUMA Domain 3 Core 1 Core 5 Core 9 Core 13 Core 2 Core 6 Core 10 Core 14 Contention for the memory controller (MC) -12- Talk by Sergey Blagodurov Stony Brook University

  13. Core 3 Core 7 Core 11 Core 15 Core 0 Core 4 Core 8 Core 12 NUMA Domain 0 NUMA Domain 1 L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache Shared L3 Cache Shared L3 Cache MC HT MC HT Memory node 0 Memory node 1 Memory node 2 Memory node 3 MC MC HT HT Shared L3 Cache Shared L3 Cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache NUMA Domain 2 NUMA Domain 3 Core 1 Core 5 Core 9 Core 13 Core 2 Core 6 Core 10 Core 14 Contention for the inter-domain interconnect (IC) -13- Talk by Sergey Blagodurov Stony Brook University

  14. A Core 3 Core 7 Core 11 Core 15 Core 0 Core 4 Core 8 Core 12 NUMA Domain 0 NUMA Domain 1 L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache Shared L3 Cache Shared L3 Cache MC HT MC HT Memory node 0 Memory node 1 Memory node 2 Memory node 3 MC MC HT HT Shared L3 Cache Shared L3 Cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache NUMA Domain 2 NUMA Domain 3 Core 1 Core 5 Core 9 Core 13 Core 2 Core 6 Core 10 Core 14 Remote access latency (RL) -14- Talk by Sergey Blagodurov Stony Brook University

  15. A B Core 3 Core 7 Core 11 Core 15 Core 0 Core 4 Core 8 Core 12 NUMA Domain 0 NUMA Domain 1 L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache Shared L3 Cache Shared L3 Cache MC HT MC HT Memory node 1 Memory node 0 Memory node 2 Memory node 3 MC MC HT HT Shared L3 Cache Shared L3 Cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache L1, L2 cache NUMA Domain 2 NUMA Domain 3 Core 1 Core 5 Core 9 Core 13 Core 2 Core 6 Core 10 Core 14 Isolating Memory controller contention (MC) -15- Talk by Sergey Blagodurov Stony Brook University

  16. Memory Controller (MC) and Interconnect (IC) contention are key factors hurting performance Dominant degradation factors -16- Talk by Sergey Blagodurov Stony Brook University

  17. A B A B Contention-Aware Scheduling -17- • Characterization method • Given two threads, decide if they will hurt each other’s performance if co-scheduled • Scheduling algorithm • Separate threads that are expected to interfere Talk by Sergey Blagodurov Stony Brook University

  18. Characterization Method -18- • Limited observability • We do not know for sure if threads compete and how severely! • Trial and error infeasible on large systems • Can’t try all possible combinations • Even sampling becomes difficult • A good trade-off: measure LLC Miss rate! • Threads interfere if they have high miss rates • No account for cache contention impact Talk by Sergey Blagodurov Stony Brook University

  19. Miss rate as a predictor for contention penalty -19- Talk by Sergey Blagodurov Stony Brook University

  20. C W Z A B D X Y Sort threads by LLC missrate: Goal: isolate threads that compete for shared resources and pull the memory to the local node upon migration A B W X A X W B C Z Y Z C D Y D A C B D MC MC HT HT MC HT MC HT Memory node 2 Memory node 1 Memory node 1 Memory node 2 Domain 1 Domain 2 Domain 1 Domain 2 Migrate competing threads along with memory to different domains Server-level scheduling -20- Talk by Sergey Blagodurov Stony Brook University

  21. SPEC CPU 2006 LAMP SPEC MPI 2007 Server-level results -21- Talk by Sergey Blagodurov Stony Brook University

  22. Node 2 Node 1 Node 0 C Core Core Core Core A A A A B B C A D D D D A A A C B B C Core Core Core Core Core Core Core Core Core Core Core Core Memory node Memory node Memory node Memory node Memory node Memory node datacenter network Node 5 Node 3 Node 4 D D D C A A A A D B B C A A A A C B B C Core Core Core Core Core Core Core Core Core Core Core Core Memory node Memory node Memory node Memory node Memory node Memory node Possibilities of datacenter-wide scheduling -22- Talk by Sergey Blagodurov Stony Brook University

  23. Contention-aware cluster scheduling: • See: online detection of contention, communication overhead, power consumption. • Think: approximate an optimal clusterschedule (cast the problem as a multi-objective one) • Do: use a low-overhead virtualization (OpenVZ) to migrate jobs across the nodes Clavis-HPC features -23- Talk by Sergey Blagodurov Stony Brook University

  24. Finding an optimal schedule: • an implementation using Choco solver • minimizes weighted sum: • Branch-and-Bound enumeration search tree: Enumeration tree search -24- Talk by Sergey Blagodurov Stony Brook University

  25. Solver evaluation • (custom branching strategy) Solver evaluation -25- Talk by Sergey Blagodurov Stony Brook University

  26. Vanilla HPC framework: • Clavis-HPC: Cluster-wide scheduling (a case for HPC) -26- Talk by Sergey Blagodurov Stony Brook University

  27. Results -27- Talk by Sergey Blagodurov Stony Brook University

  28. Faster execution saves money: • A datacenter with $30M electricity bill • 20% less energy due to • faster execution $6M/year savings! What’s the impact? -28- Talk by Sergey Blagodurov Stony Brook University

  29. Eric Schmidt, former CEO of Google: • Every two days now we create as much data as we did from the dawn of civilization up until 2003. • Big Data: Big Responsibility Big Money What’s next? -29- Talk by Sergey Blagodurov Stony Brook University

  30. Big Data has many facets -30- Talk by Sergey Blagodurov Stony Brook University

  31. Use case: sensor data from a cross-country flight -31- Talk by Sergey Blagodurov Stony Brook University

  32. #1 Memory hierarchy in Exascale era Compute node Compute node Core Core Core Core Core Core Core Core Core Core Core Core Memory node Memory node Memory node Memory node Memory node Memory node PCRAM PCRAM PCRAM PCRAM will turn into: FLASH FLASH FLASH FLASH Storage Software defined storage Future research directions -32- Talk by Sergey Blagodurov Stony Brook University

  33. #2 Big Data placement analysis Big Data data or analysis Future research directions -33- Talk by Sergey Blagodurov Stony Brook University

  34. #3 How to choose a datacenter for a given Big Data analytic task? warehouse? HPC cluster? task cloud? smth else? Future research directions -34- Talk by Sergey Blagodurov Stony Brook University

  35. In a nutshell: • Datacenters is the platform of choice • Datacenter servers are major energy consumers • The energy is wasted because of resource contention • I address the resource contention automatically and on-the-fly • Future plans: Big Data retrieval and analysis Conclusion -35- Talk by Sergey Blagodurov Stony Brook University

  36. Addressing shared resource contention in datacenter servers Any [time for] questions? Talk by Sergey Blagodurov Stony Brook University

  37. 2). Resource Manager (RM) on the head node receives the submission request and passes it to the Job Scheduler (JS). 3). JS determines what jobs execute on what containers and passes the scheduling decision to RM. 4). RM starts/stops the jobs on the given containers. 6). RM generates a contention-aware report about resource usage in the cluster during the last scheduling interval. 10). RM passes the contention-aware resource usage report to JS. 1). User connects to the HPC cluster via client and submits a job with a PBS script. The user can characterize the job with a contention metric (devil, comm-devil). 5). The virtualized jobs execute on the containers under the contention aware user-level scheduler (Clavis-DINO). They access cluster storage to get their input files and store the results. Clients (tablet, laptop, desktop, etc) Head node RM, JS, Clavis-HPC Cluster network (Ethernet, InfiniBand) Compute nodes contention monitors (Clavis) OpenVZ containers libraries (OpenMPI, etc) RM daemons (pbs_mom) 7). Users or sysadmins analyze the contention-aware resource usage report. 8). Users can checkpoint their jobs (OpenVZ snapshots). 9). Sysadmins can perform automated job migration across the nodes through OpenVZ live migration and are able to dynamically consolidate workload on fewer nodes , turn the restoff to save power. Monitoring (JS GUI), control (IPMI, iLO3, etc) Centralized cluster storage (NFS, Lustre) • Clavis-HPC framework -37- Talk by Sergey Blagodurov Stony Brook University

  38. Results of the contention-aware experiments Clavis-HPC additional results -38- Talk by Sergey Blagodurov Stony Brook University

  39. Vanilla HPC framework: • Clavis-HPC: Cluster-wide scheduling (a case for HPC) -39- Talk by Sergey Blagodurov Stony Brook University

  40. CPU and Memory • are the biggest consumers Cooling and other infrastructure: 10-30% Servers: 70-90% Where do datacenters spend energy? -40- Talk by Sergey Blagodurov Stony Brook University

  41. Critical (preferred access to resources): • RUBiS • WikiBench • Non-critical: • Datacenter batch load: Swaptions Facesim FDS • HPC jobs: LU, BT, CG Cloud datacenter workloads Talk by Sergey Blagodurov Stony Brook University -41-

  42. Server under-utilization is a long standing problem: • Increases both CapEx and OpEx costs • Even for modern servers energy efficiency at 30% load can be less than half the efficiency at 100% load. • Work-conserving vs. non work-conserving collocation: • managing with caps (limits) vs. managing with weights • (priorities) improving isolation vs. improving server utilization • Solution: • Collocate critical and non-critical applications. • Manage resource access through Linux control group mechanisms. Automated collocation -42- Talk by Sergey Blagodurov Stony Brook University

  43. Automated collocation enables net-zero energy usage: What’s the impact? -43- Talk by Sergey Blagodurov Stony Brook University

  44. Scenario A (Swaptions, Facesim, FDS) • Scenario B (LU, BT,CG) Workload collocation using static prioritization -44- Talk by Sergey Blagodurov Stony Brook University

  45. Weight-based collocation: • tolerable critical workload performance loss Workload collocation during spikes -45- Talk by Sergey Blagodurov Stony Brook University

  46. A value twice as high for a process compared to another = twice as many CPU cycles Workload collocation during spikes -46- Talk by Sergey Blagodurov Stony Brook University

  47. #4 What storage organization is the most suitable for each datacenter type? cloud warehouse HPC cluster parallel key/value? filesystem? databases? Future research directions -47- Talk by Sergey Blagodurov Stony Brook University

  48. Data assurance for power delivery networks Data warehouse Record from meter A Record from meter B Record from meter C C is broken Assurance rules Record from meter A Record from meter B Data warehouse project -48- Talk by Sergey Blagodurov Stony Brook University

  49. LLC missrate works, but is not very accurate • What if we want a metric that is more accurate? • Then we need to profile many performance counters simultaneously • … and we need to build a model that predicts the degradation. • We would have to train the model beforehand on a representative workload. The need of training the model is the price of higher accuracy! Increasing prediction accuracy -49- Talk by Sergey Blagodurov Stony Brook University

  50. Our Solution Devising an accurate metric (outline) -50- Talk by Sergey Blagodurov Stony Brook University

More Related