1 / 49

Scheduling and Guided Search for Cloud Applications

Scheduling and Guided Search for Cloud Applications. Cristiana Amza. Big Data is Here. Data growth (by 2015) = 100x in ten years [IDC 2012] Population growth = 10% in ten years Monetizing data for commerce, health, science, services, …. [source: Economist]. courtesy of  Babak Falsafi.

boyd
Download Presentation

Scheduling and Guided Search for Cloud Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scheduling and Guided Search for Cloud Applications Cristiana Amza

  2. Big Data is Here • Data growth (by 2015) = 100x in ten years [IDC 2012] • Population growth = 10% in ten years • Monetizing data for commerce, health, science, services, …. [source: Economist] courtesy of BabakFalsafi

  3. Data Growing Faster than Technology Growing technology gap WinterCorp Survey, www.wintercorp.com courtesy of BabakFalsafi

  4. Challenge 1: Costs of a datacenter 3yr server and 10yr infrastructure amortization Estimated costs of datacenter: 46,000 servers $3,500,000 per month to run Server & Power are 88% of total cost Data courtesy of James Hamilton [SIGMOD’11 Keynote]

  5. Datacenter Energy Not Sustainable A Modern Datacenter 17x football stadium, $3 billion Billion Kilowatt hour/year • Modern datacenters 20 MW! • In modern world, 6% of all electricity, growing at >20%! 50 million homes 2001 2005 2009 2013 2017 courtesy of BabakFalsafi

  6. Challenge 2: Data Management (Anomalies) Cloudy Whoops – Facebook loses 1 billion photos Chris Keall, 10 March 2009, The National Business Review When the Cloud Fails: T-Mobile, Microsoft LoseSidekick Customer Data Om Malik, 10 October 2009, gigaom.com with a chance of failure courtesy of HaryadiS. Gunawi Amazon Can’t Recover All Its Cloud Data From Outage Max Eddy, 27 April 2011, www.geekosystem.com Cloud Storage Often Results in Data Loss Chad Brooks, 10 October 2011, www.businessnewsdaily.com

  7. Problems are entrenched • I have been working in this area since 2001 • Problems have only grown more complex/intractable • Same old Distributed Systems problems • New: Levels of indirection (remote processing, deep software stacks, VMs, etc) • Eg: Cloud monitoring and logging data (terrabytes per day) • But, no notable success stories with analyzing such data

  8. Reduce Reduce Reduce Challenge 3: Paradigm Limitations • MapReduceparallelism: • Embarrassing/simplistic • Works for aggregate op • Simple scheduling Map Map Map

  9. Hadoop Hadoop/Enterprise: Separate Storage Silos Hardware $$$ Cross-silo data management $$$ Periodic data ingest

  10. What can we do ?Find Meaningful Apps • We can produce/find tons of data • Need to analyze something of vital importance to justify draining vital resources • Otherwise the simplest solution is to stop creating the problem(s)

  11. What can we do ?Consolidate Research Agendas • Find overarching, mission critical paradigms • State of the art: MapReduce too simplistic • Develop standards, common tools and benchmarks • Integrate solutions, think holistically • Enforce accountability for data center/Cloud provider

  12. Opportunity 1: The Brain Challenge • Started to explore Neuroscience workloads in 2010 • A Brain Summit/Workshop held at IBM TJ Watson • Started a collaboration with Stephen Strother at Baycrest a year later • An application that is both data and compute intensive • Boils down to an optimization problem in a highly parametrized search space

  13. Opportunity 2: Guided Modeling • Performance modeling, energy modeling, anomaly modeling, biophysical modeling • All tend to be interpolations/searches/optimizations in highly parametrized spaces • Key idea: Develop a common framework that works for all • Extend the way MapReduce standardized aggregation ops • Guidance: Operator “Reduction”, “Linear Interpolation”, etc

  14. Building models takes time Exhaustive sampling takes 11 days! High Latency 32x32=1024 sampling points Low Latency Avg. Latency Storage Memory 32 data points DB Memory 16GB Actuate a live system and take experimental samples. Sample in 512MB chunks; 15 minutes for each point

  15. Goal: Reduce Time by Model Reuse Provide resource-to-performance mapping High Latency Low Latency Avg. Latency Storage Resources Less DB Resources More • Dynamic Resource Allocation [FAST’09] • Capacity Planning [SIGMOD’13] • What-if Queries [SIGMETRICS’10] • Anomaly Detection[SIGMETRICS’10] • Towards A Virtual Brain Model[HPCS’14]

  16. Management Interactions Service Provider • Use less resources: Customer wants 1000 TPS. What is the most efficient (e.g., CPU/Memory) to deliver it? • Share resources: Can I place customer A’s DB along side customer B’s DB? Will their service-levels be met? Customer DBA • Use the right amount of resources: What will be the performance (e.g., query latency) if I use 8GB of RAM instead of 16GB? • Solve performance problems: I’m only getting 500 TPS. What’s wrong? Is the cloud to blame? Need to build performance models to understand

  17. Libraries/archive of models Analytical Models No samples required Difficult to derive Fragileto maintain Gray-box Models Few samples needed Can be adapted Still need to derive Use an Ensemble of models Black-box Models Minimal assumptions Needs lots of samples Couldover-fit Knowledge driven Data driven

  18. Model Ensemble approach 1.Guidance as trends and patterns 2.Automatically tunethe models 3.Rank the models use a blend y y y y y y x x x x 1 x x Test & Rank y Use data 2 x Repeat (if needed)

  19. How to specify guidance Specifies model inputsand parameters y x SelfTalk Language to describe relationships Provide a catalogof common functions Curve-fitting and validation algorithms Details in SIGMETRICS’10 paper

  20. Refine models using data HINTmyHint RELATION LINEAR(x,y) METRIC (x,y) { x.name=MySQL.CPU y.name=MySQL.QPS } CONTEXT (a) { a.name=MySQL.BufPoolAlloc a.value >= 512MB } Hintsthat CPU linearly correlatedto QPS y x Use hintsto linkrelations to metrics But working-set should be in RAM Learns parameters using data (or requests more data)

  21. Rank models and blend 1.Dividesearch space into regions y y y High Latency 2.n-fold cross-validation to rank 3.Associate best-model to region Avg. Latency Storage Memory DB Memory 16GB

  22. Prototype “How should I partition resources between two applications A and B?” SelfTalk, Catalog of models, data MySQL & Storage Server

  23. model new workload Runtime Engine reuse similar data/models Model Matching Expand Samples refine if necessary selective, iterative and ensemble learning process Model Validation and Refinement Model Repository 23

  24. Ex 1: Predicting buffer pool latencies Analytical model replaced with data-driven models

  25. Ex 2: Model Transformation

  26. Guidance: Step function (3D to 2D reduction)

  27. With minimum new samples …

  28. Ex 3: Modeling and Job Scheduling for Brain • Data centers usually have heterogeneous structure – variety of multicores, GPUs, etc. • Different stages of application have different resource demands (CPU versus data intensive) • Job scheduling to available resources becomes non-trivial • Guided modeling helps

  29. Functional MRI • Goal • Studying brain functionality • Procedure • Asking patients (subjects) to do a task and capturing brain slices measuring blood oxygen level. • Correlating images to identify brain activity

  30. Functional MRI • Overall Pipeline Subject Selection & Experimental Design Data Acquisition Data Preprocessing Analysis Model Results

  31. Functional MRI

  32. NPAIRS As Our Application • NPAIRS Goal: Processing images to find images correlations • Feature Extraction: A common technique in image processing applications (e.g. Face Recognition) • Using Principal Component Analysis to extract Eigen Vectors • Finding a set of Eigen Vectors which is a good representative for the whole set of subjects Machine Learning Methods, Heuristic Search, etc.

  33. NPAIRS Statistical Parametric MapSJ2 REPRODUCIBILITY ESTIMATE (r) SPLIT-HALF 2 “Design” Data Scans FULL DATA vSJ2 Split J SPLIT-HALF 1 “Design” Data SPMSJ1 Scans vSJ1

  34. Output of NPAIRS

  35. NPAIRS Flowchart

  36. NPAIRS Profiling Results

  37. GPU Execution Profile

  38. NPAIRSExecution on Different Nodes

  39. Job Modeling: Exhaustive Sampling Sample Set: 1 to 99 Fitness Score R2: 0.995 Total Run Time: 64933

  40. Uniform Sampling Sample Set: 2 12 22 32 42 52 62 72 82 92 Using 5-fold cross validation Fitness Score R2: 0.990 Total Run Time: 6368

  41. Guidance: Step function + Fast Sampling Sample Set: 2 4 8 12 16 20 24 32 48 96 Fitness Score R2: 0.993 Total Run Time: 5313 16.6% Time Saving!

  42. Heterogeneous CPU Only (3Fat + 5Light Nodes)

  43. 3 Fat, 3 Light, 3 GPU nodes

  44. Resource Utilization

  45. Overall Execution Time Comparison

  46. Conclusions • Big Data processing is driving a quantum leap in IT • Hampered by slow progress in data center management • We propose to investigate guided modeling • Promising preliminary results with Neuroscience workloads • 7x speedup of NPAIRS on small CPU+GPU cluster

  47. Backup Slides

  48. Modeling Procedure • Get sample set and split it using 5-fold cross validation • Fit the model using 4 folds sample data • Test the model using 1 fold sample data • Try 5 type of splits, and sum the model error. • If the error is less than the threshold, stop. We found the model.

  49. Notes • Total run time is the summary of total sampling time. Modeling time is negligible. • Use the exhaustive data set as the true value, and the fitted model to predict the value. Then compute the coefficient of determination R2.

More Related