Using Statistical Machine Learning in Cloud Computing

Using Statistical Machine Learning in Cloud Computing David Patterson, UC BerkeleyReliable Adaptive Distributed Systems Lab ` Image: John Curley http://www.flickr.com/photos/jay_que/1834540/

Datacenter is new “server” “Program” == Web search, email, map/GIS, … “Computer” == 1000’s computers, storage, network Warehouse-sized facilities and workloads New datacenter ideas (2007-2008): truck container (Sun), floating (Google), datacenter-in-a-tent (Microsoft) How to enable innovation in new services without first building & capitalizing a large company? 2 photos: Sun Microsystems & datacenterknowledge.com

Outline • Cloud Computing • RAD Lab • Case Study 1: Director & SCADS • Cloud storage and automatic management • Case Study 2: Automatic Log Analysis • Find anomalous behavior • Case Study 3: Predicting Map Reduce jobs • Help with scheduling • Summary

Cloud Computing is Hot • 9/15/09 Federal CIO Vivek Kundra embraces Cloud Computing • “We’ve been building data center after data center, acquiring application after application, and frankly what it’s done is drive up the cost of technology immensely across the board. What we need is to find a more innovative path in addressing these problems.”

But... What is cloud computing, exactly?

“It’s nothing (new)” “...we’ve redefined Cloud Computing to include everything that we already do... I don’t understand what we would do differently ... other than change the wording of some of our ads.” Larry Ellison, CEO, Oracle (Wall Street Journal, Sept. 26, 2008)

Above the Clouds:A Berkeley View of Cloud Computing abovetheclouds.cs.berkeley.edu • 2/09 White paper by RAD Lab PI’s and students • Clarify terminology around Cloud Computing • Quantify comparison with conventional computing • Identify Cloud Computing challenges & opportunities • Why can we offer new perspective? • Strong engagement with industry • Users of cloud computing in our own research and teaching in last 18 months • Goal: stimulate discussion on what’s really new • without resorting to weather analogies ad nauseam

Utility Computing Arrives • Amazon Elastic Compute Cloud (EC2) • “Compute unit” rental: $0.10-0.80/hr. • 1 CU ≈ 1.0-1.2 GHz 2007 AMD Opteron/Xeon core • N • No up-front cost, no contract, no minimum • Billing rounded to nearest hour; pay-as-you-go storage also available • A new paradigm (!) for deploying services? 8

What is it? What’s new? • Old idea: Software as a Service (SaaS) • Basic idea predates MULTICS • Software hosted in the infrastructure vs. installed on local servers or desktops; dumb (but brawny) terminals • New: pay-as-you-go utility computing • Illusion of infinite resources on demand • Fine-grained billing: release == don’t pay • Earlier examples: Sun, Intel Computing Services—longer commitment, more $$$/hour, no storage • Public (utility) vs. private clouds

Why Now (not then)? • “The Web Space Race”: Build-out of extremely large datacenters (10,000’s of commodity PCs) • Build-out driven by growth in demand (more users) => Infrastructure software: e.g., Google File System => Operational expertise: failover, DDoS, firewalls... • Discovered economy of scale: 5-7x cheaper than provisioning a medium-sized (100’s machines) facility • More pervasive broadband Internet • Commoditization of HW & SW • Fast Virtualization • Standardized software stacks

Classifying Clouds • Instruction Set VM (Amazon EC2) • Managed runtime VM (Microsoft Azure) • Framework VM (Google AppEngine) • Tradeoff: flexibility/portability vs. “built in” functionality Lower-level, Less managed Higher-level, More managed EC2 Azure AppEngine

Cloud Economics 101 • Cloud Computing User: Static provisioning for peak - wasteful, but necessary for SLA Capacity Machines $ Capacity Demand Demand “Statically provisioned” data center “Virtual” data center in the cloud Time Time Unused resources

Cloud Economics 101 • Cloud Computing Provider: Could save energy Capacity Energy Machines Capacity Demand Demand “Statically provisioned” data center Real data center in the cloud Time Time Unused resources

Risk of Under Utilization • Underutilization results if “peak” predictions are too optimistic Unused resources Capacity Resources Demand Time Static data center

Risks of Under Provisioning Resources Resources Resources Capacity Capacity Capacity Lost revenue Demand Demand Demand 2 2 2 3 3 3 1 1 1 Time (days) Time (days) Time (days) Lost users

New Scenarios Enabled by “Risk Transfer” to Cloud • “Cost associativity”: 1,000 CPUs for 1 hour same price as 1 CPUs for 1,000 hours (@$0.10/hour) • Washington Post converted Hillary Clinton’s travel documents to post on WWW <1 day after released • RAD Lab graduate students demonstrate improved Hadoop (batch job) scheduler—on 1,000 servers • Major enabler for SaaS startups • Animoto traffic doubled every 12 hours for 3 days when released as Facebook plug-in • Scaled from 50 to >3500 servers • ...then scaled back down

Cloud Computing & Statistical Machine Learning? • Before CC, only performance optimization on mostly small scale systems • CC  detailed cost-performance model • Optimization more difficult with more metrics • CC  Everyone can use 1000+ servers • Optimization more difficult at large scale • Economics rewards scale up AND down • Optimization more difficult if add/drop servers • SML as optimization difficulty increases

RAD Lab 5-year Mission Enable 1 person to develop, deploy, operate next -generation Internet application • Key enabling technology: Statistical machine learning • management, scaling, anomaly detection, performance prediction, ... • Highly interdisciplinary faculty & students • PI’s: Fox/Katz/Patterson (systems/networks), Jordan (machine learning), Stoica (networks & P2P), Joseph (systems/security), Franklin (databases) • 2 postdocs, ~30 PhD students, ~5 undergrads

RAD Lab Prototype:System Architecture Web 2.0 apps Drivers Drivers Drivers Ruby on Rails environment web svc APIs Chukwa trace coll. local OS functions VM monitor SCADS Chukwa trace coll. local OS functions Chukwa & XTrace (monitoring) New apps, equipment, global policies (eg SLA) Offered load, resource utilization, etc. Director Training data performance & cost models Log Mining AutomaticWorkload Evaluation (AWE)

Successes • Automatically add/drop servers to fit demand, without violating Service Level Agreement (SLA) • Predict performance of complex software system when demand is scaled up • Distill millions of lines of log messages into an operator-friendly “decision tree” that pinpoints “unusual” incidents/conditions • Recurring theme: cutting-edge Statistical Machine Learning (SML) works where simpler methods have failed

Automatic Management of a Datacenter • As datacenters grow, need to automatically manage the applications and resources • examples: • deploy applications • change configuration, add/remove virtual machines • recover from failures • Director: • mechanism for executing datacenter actions • Advisors: • intelligence behind datacenter management

Director Framework performance model workload model Advisor Advisor Advisor Advisor monitoring data config Director Drivers Datacenter(s) VM VM VM VM

Director Framework • Director • issues low-level/physical actions to the DC/VMs • request a VM, start/stop a service • manage configuration of the datacenter • list of applications, VMs, … • Advisors • update performance, utilization metrics • use workload, performance models • issue logical actions to the Director • start an app, add 2 app servers

What About Storage? • Easy to imagine how to scale up and scale down computation • Database don’t scale down, usually run into limits when scaling up • What would it mean to have datacenter storage that could scale up and down as well so as to save money for storage in idle times?

DC Storage Motivation • Most popular websites follow the same pattern • Rapidly developed on SQLServer / PostgreSQL / MySQL / etc. • Become popular and realize scaling limitations • Build large, complicated ad-hoc systems to deal with scaling limitations as they arise • Websites that can’t scale fast enough lose customers

SCADS: Scalable, Consistency-Adjustable Data Storage • Scale Independence - as the user base grows: • No changes to application • Cost per user doesn’t increase • Request latency doesn’t change • Key Innovations • Performance safe query language • Declarative performance/consistency tradeoffs • Automatic scale up and down using machine learning

Scale Independence Arch • Developers provide performance safe queries along with consistency requirements • Use ML, workload information, and requirements to provision proactively via repartitioning keys and replicas

SCADS Performance Model(on m1.small, all data in memory) 5% writes 1% writes SLA threshold 99th percentile median

Low workload, low put rate

Mining Console Logs for Large-Scale System Problem Detection • Why console logs? • Console logs are everywhere • In almost every software system • Hand-picked information by developers • Expressive, convenient to use • Low programmer overhead • Especially useful in large scale Internet services • Open source code + in-house development • Continuously changing system “Detecting Large-Scale System Problem Detection by Mining Console Logs,” Wei Xu, Ling Huang, Armando Fox, David Patterson, Michael Jordan, Symposium on Operating Systems Principles (SOSP), Oct 2009

Console logs use awkward languages HODIE NATUS EST RADICI FRATER* today unto the Root a brother is born. * "that crazy Multics error message in Latin." http://www.multicians.org/hodie-natus-est.html

Console logs are not operator friendly Console Logs Operators grep Perl scripts search • Problem – Don’t know what to look for! • Console logs are intended for a single developer • Assumption: log writer == log reader • Today many developers => massive textual logs • Our goal - Discover the most interesting log • messages without any prior input

Console logs are hard for machines too Machine Learning Feature Creation Machine Learning Parsing Visualization • Problem • Highly unstructured, looks like free text • Not able to capture detailed program state with texts • Hard for operators to understand detection results • Our contribution • A general framework for processing console logs • Efficient parsing and features

Key observations • Parsing • Console logs were inherentedly structured • Determined by log printing statement • Constant strings = markers of message structure • Source code is usually available • Features • Common information reported in logs • Identifiers and execution paths • State variables • Correlations among messages • Many ways to group related log messages • i.e. not just by time

Case studies • Surveyed a number of software apps • Linux kernel, OpenSSH, Apache, MySQL, Jetty • Hadoop, Cassandra, Nutch … • Parsing works on 20/22 • In this talk – Hadoop file system • Per block operation logging (usually ignored) • Experiment on EC2 cloud • 203 nodes * 48 hours • ~ 300 TB HDFS data (550,000 blocks) • ~24 million lines of console logs

Step 1: Using source code analysis to help log parsing • Basic ideas • Non-trivial in object oriented languages • toString() method of objects • Dynamic binding (sub-classing) • Our Solution • Analyze source code to abstract syntax tree (AST) level • Pre-compute all possible message types • Free text -> semistructured text 091012 03-00-00 INFO Creating file mydata Log.info(“Creating file ” + filename); [filename] Creating file (.*)

Step 2: Feature - Message count vector • Identifiers are common in logs • file names, object keys, user ids • Identifiers can be discovered automatically • Have many distinct values • Appear in multiple message types • Reported many times • Grouping by identifiers • Similar to execution paths • Numerical representation of these “paths” • Similar to Bag of words model in IR myfile: Creating Write Backup Done

Message count vector example datanode_r16| Receivingblock blk_100 src: … dest:... namenode_r10 | allocateBlock: blk_100 namenode_r10 | allocateBlock: blk_200 datanode_r16| Receivingblock blk_200 src: … dest:... datanode_r14 | Receiving block blk_100 src: …dest:… datanode_r16 | Received block blk_100 of size 49486737 from … datanode_r14 | Received block blk_100 of size 49486737 from … datanode_r16 | Error Receiving block blk_200 of size 49486737 from … blk_100 2 2 0 1 2 0 0 2 0 0 0 0 0 0 0 0 blk_200 1 0 0 1 2 0 0 2 0 0 0 0 0 0 0 1

Step 3: MiningPCA detection 55,000 vectors, one per block 0 2 2 1 2 0 0 2 0 1 0 0 0 0 0 0 • This example - detect abnormal vectors • Dimensions highly correlated • Rare correlations indicate abnormal executions • PCA separates normal pattern from abnormal, making anomalies easy to detect • From text -> structured feature • With details of program execution • Can apply many other machine learning algorithms

PCA detection results False Positives Making results easy for operators to understand!

Explaining detection results with decision tree writeBlock # received exception >=1 1 0 # Starting thread to transfer block # to # >=3 1 <=2 #: Got exception while serving # to #:# >=1 0 0 Unexpected error trying to delete block #\. BlockInfo Not found in volumeMap >=1 1 0 addStoredBlock request received for # on # size # But it does not belong to any file >=1 1 0 # starting thread to transfer block # to # >=1 0 0 #Verification succeeded for # >=1 0 0 Receiving block # src: # dest: # <=2 1 >=3 0

Summary: Mining Console Logs for Large-Scale System Problem Detection Extract Detect Visualize A single decision tree to visualize system behavior 200 nodes, >24 million lines of logs abnormal log segments Contributions: How to parse logs, How to turn into features so ready for SML, How to make results accessible to operators, Which SML algorithm used matters less

Using SML to Characterize, Predict, Optimize Resources Given a system (e.g. Web Service, multicore …), its workload, and performance metrics: • Answer “what-if” questions for • Changes in workload (mix/rate/distribution) • Changes in system configuration (hw/sw) • Changes in both • 3 Examples for Case Study 3: • Parallel Database Performance Prediction • Hadoop Job Performance Prediction • Multi-core Performance Optimization (if time permits)

Problem Statement • Find relationships between workload and performance • How to generate the model? • Manually constructed queueing model • Automatically constructed • Regression, genetic algorithms, correlation analysis,… • This talk: Canonical Correlation Analysis Workload SYSTEM Behavior Model

Finding Correlations

Kernel Canonical Correlation Analysis (KCCA) • 2003, Bach & Jordan at Berkeley • Find dimensions of maximal correlation between kernelized datasets • “Kernel functions” impose system-specific definition of similarity • Dimensionality reduction + clustering effect Advantages: • Relative distance between datapoints (defines neighborhoods) • Interpolation (preserves neighborhoods across both datasets)

3 Challenges in using SML • Mapping function from traces to feature vectors • Workload -> multidimensional space • System behavior -> multidimensional space • How to compare similarity of two feature vectors • How to leverage results of SML techniques to find actionable answer

Using Statistical Machine Learning in Cloud Computing