Scheduling and Guided Search for Cloud Applications

Scheduling and Guided Search for Cloud Applications Cristiana Amza

Big Data is Here • Data growth (by 2015) = 100x in ten years [IDC 2012] • Population growth = 10% in ten years • Monetizing data for commerce, health, science, services, …. [source: Economist] courtesy of BabakFalsafi

Data Growing Faster than Technology Growing technology gap WinterCorp Survey, www.wintercorp.com courtesy of BabakFalsafi

Challenge 1: Costs of a datacenter 3yr server and 10yr infrastructure amortization Estimated costs of datacenter: 46,000 servers $3,500,000 per month to run Server & Power are 88% of total cost Data courtesy of James Hamilton [SIGMOD’11 Keynote]

Datacenter Energy Not Sustainable A Modern Datacenter 17x football stadium, $3 billion Billion Kilowatt hour/year • Modern datacenters 20 MW! • In modern world, 6% of all electricity, growing at >20%! 50 million homes 2001 2005 2009 2013 2017 courtesy of BabakFalsafi

Challenge 2: Data Management (Anomalies) Cloudy Whoops – Facebook loses 1 billion photos Chris Keall, 10 March 2009, The National Business Review When the Cloud Fails: T-Mobile, Microsoft LoseSidekick Customer Data Om Malik, 10 October 2009, gigaom.com with a chance of failure courtesy of HaryadiS. Gunawi Amazon Can’t Recover All Its Cloud Data From Outage Max Eddy, 27 April 2011, www.geekosystem.com Cloud Storage Often Results in Data Loss Chad Brooks, 10 October 2011, www.businessnewsdaily.com

Problems are entrenched • I have been working in this area since 2001 • Problems have only grown more complex/intractable • Same old Distributed Systems problems • New: Levels of indirection (remote processing, deep software stacks, VMs, etc) • Eg: Cloud monitoring and logging data (terrabytes per day) • But, no notable success stories with analyzing such data

Reduce Reduce Reduce Challenge 3: Paradigm Limitations • MapReduceparallelism: • Embarrassing/simplistic • Works for aggregate op • Simple scheduling Map Map Map

Hadoop Hadoop/Enterprise: Separate Storage Silos Hardware $$$ Cross-silo data management $$$ Periodic data ingest

What can we do ?Find Meaningful Apps • We can produce/find tons of data • Need to analyze something of vital importance to justify draining vital resources • Otherwise the simplest solution is to stop creating the problem(s)

What can we do ?Consolidate Research Agendas • Find overarching, mission critical paradigms • State of the art: MapReduce too simplistic • Develop standards, common tools and benchmarks • Integrate solutions, think holistically • Enforce accountability for data center/Cloud provider

Opportunity 1: The Brain Challenge • Started to explore Neuroscience workloads in 2010 • A Brain Summit/Workshop held at IBM TJ Watson • Started a collaboration with Stephen Strother at Baycrest a year later • An application that is both data and compute intensive • Boils down to an optimization problem in a highly parametrized search space

Opportunity 2: Guided Modeling • Performance modeling, energy modeling, anomaly modeling, biophysical modeling • All tend to be interpolations/searches/optimizations in highly parametrized spaces • Key idea: Develop a common framework that works for all • Extend the way MapReduce standardized aggregation ops • Guidance: Operator “Reduction”, “Linear Interpolation”, etc

Building models takes time Exhaustive sampling takes 11 days! High Latency 32x32=1024 sampling points Low Latency Avg. Latency Storage Memory 32 data points DB Memory 16GB Actuate a live system and take experimental samples. Sample in 512MB chunks; 15 minutes for each point

Goal: Reduce Time by Model Reuse Provide resource-to-performance mapping High Latency Low Latency Avg. Latency Storage Resources Less DB Resources More • Dynamic Resource Allocation [FAST’09] • Capacity Planning [SIGMOD’13] • What-if Queries [SIGMETRICS’10] • Anomaly Detection[SIGMETRICS’10] • Towards A Virtual Brain Model[HPCS’14]

Management Interactions Service Provider • Use less resources: Customer wants 1000 TPS. What is the most efficient (e.g., CPU/Memory) to deliver it? • Share resources: Can I place customer A’s DB along side customer B’s DB? Will their service-levels be met? Customer DBA • Use the right amount of resources: What will be the performance (e.g., query latency) if I use 8GB of RAM instead of 16GB? • Solve performance problems: I’m only getting 500 TPS. What’s wrong? Is the cloud to blame? Need to build performance models to understand

Libraries/archive of models Analytical Models No samples required Difficult to derive Fragileto maintain Gray-box Models Few samples needed Can be adapted Still need to derive Use an Ensemble of models Black-box Models Minimal assumptions Needs lots of samples Couldover-fit Knowledge driven Data driven

Model Ensemble approach 1.Guidance as trends and patterns 2.Automatically tunethe models 3.Rank the models use a blend y y y y y y x x x x 1 x x Test & Rank y Use data 2 x Repeat (if needed)

How to specify guidance Specifies model inputsand parameters y x SelfTalk Language to describe relationships Provide a catalogof common functions Curve-fitting and validation algorithms Details in SIGMETRICS’10 paper

Refine models using data HINTmyHint RELATION LINEAR(x,y) METRIC (x,y) { x.name=MySQL.CPU y.name=MySQL.QPS } CONTEXT (a) { a.name=MySQL.BufPoolAlloc a.value >= 512MB } Hintsthat CPU linearly correlatedto QPS y x Use hintsto linkrelations to metrics But working-set should be in RAM Learns parameters using data (or requests more data)

Rank models and blend 1.Dividesearch space into regions y y y High Latency 2.n-fold cross-validation to rank 3.Associate best-model to region Avg. Latency Storage Memory DB Memory 16GB

Prototype “How should I partition resources between two applications A and B?” SelfTalk, Catalog of models, data MySQL & Storage Server

model new workload Runtime Engine reuse similar data/models Model Matching Expand Samples refine if necessary selective, iterative and ensemble learning process Model Validation and Refinement Model Repository 23

Ex 1: Predicting buffer pool latencies Analytical model replaced with data-driven models

Ex 2: Model Transformation

Guidance: Step function (3D to 2D reduction)

With minimum new samples …

Ex 3: Modeling and Job Scheduling for Brain • Data centers usually have heterogeneous structure – variety of multicores, GPUs, etc. • Different stages of application have different resource demands (CPU versus data intensive) • Job scheduling to available resources becomes non-trivial • Guided modeling helps

Functional MRI • Goal • Studying brain functionality • Procedure • Asking patients (subjects) to do a task and capturing brain slices measuring blood oxygen level. • Correlating images to identify brain activity

Functional MRI • Overall Pipeline Subject Selection & Experimental Design Data Acquisition Data Preprocessing Analysis Model Results

Functional MRI

NPAIRS As Our Application • NPAIRS Goal: Processing images to find images correlations • Feature Extraction: A common technique in image processing applications (e.g. Face Recognition) • Using Principal Component Analysis to extract Eigen Vectors • Finding a set of Eigen Vectors which is a good representative for the whole set of subjects Machine Learning Methods, Heuristic Search, etc.

NPAIRS Statistical Parametric MapSJ2 REPRODUCIBILITY ESTIMATE (r) SPLIT-HALF 2 “Design” Data Scans FULL DATA vSJ2 Split J SPLIT-HALF 1 “Design” Data SPMSJ1 Scans vSJ1

Output of NPAIRS

NPAIRS Flowchart

NPAIRS Profiling Results

GPU Execution Profile

NPAIRSExecution on Different Nodes

Job Modeling: Exhaustive Sampling Sample Set: 1 to 99 Fitness Score R2: 0.995 Total Run Time: 64933

Uniform Sampling Sample Set: 2 12 22 32 42 52 62 72 82 92 Using 5-fold cross validation Fitness Score R2: 0.990 Total Run Time: 6368

Guidance: Step function + Fast Sampling Sample Set: 2 4 8 12 16 20 24 32 48 96 Fitness Score R2: 0.993 Total Run Time: 5313 16.6% Time Saving!

Heterogeneous CPU Only (3Fat + 5Light Nodes)

3 Fat, 3 Light, 3 GPU nodes

Resource Utilization

Overall Execution Time Comparison

Conclusions • Big Data processing is driving a quantum leap in IT • Hampered by slow progress in data center management • We propose to investigate guided modeling • Promising preliminary results with Neuroscience workloads • 7x speedup of NPAIRS on small CPU+GPU cluster

Backup Slides

Modeling Procedure • Get sample set and split it using 5-fold cross validation • Fit the model using 4 folds sample data • Test the model using 1 fold sample data • Try 5 type of splits, and sum the model error. • If the error is less than the threshold, stop. We found the model.

Notes • Total run time is the summary of total sampling time. Modeling time is negligible. • Use the exhaustive data set as the true value, and the fitted model to predict the value. Then compute the coefficient of determination R2.

Scheduling and Guided Search for Cloud Applications

Scheduling and Guided Search for Cloud Applications

Presentation Transcript

Cloud Based Analytics for Cloud Based Applications

Towards Natural Question-Guided Search

Building Applications for the Cloud Applications

Cloud Technologies and Their Applications

CLOUD SERVICES FOR ENGINEERING APPLICATIONS

HIPAA and Cloud Applications

Image-Guided Surgery Applications

Scheduling in Cloud

Cloud Scheduling

The Search for the Cloud

Towards Natural Question-Guided Search

Designing scalable applications for cloud

Scheduling Generic Parallel Applications –Meta-scheduling

Scheduling for MW applications

Cloud Applications

Cloud Scheduling

Search, Discovery, and the Cloud

Cloud-DLS: Dynamic trusted scheduling for Cloud computing

Cloud Task Scheduling

Cloud Search

Cloud Based Analytics for Cloud Based Applications