Revolutionizing Data Systems: Strategies for Scalability

Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC Berkeley Joint work with: Michael Armbrust, Peter Bodik, Kristal Curtis, Armando Fox, Randy Katz, Mike Jordan, Nick Lanham, David Patterson, Scott Shenker, Ion Stoica, Beth Trushkowsky, Stephen Tu and MateiZaharia Image: John Curley http://www.flickr.com/photos/jay_que/1834540/

Abstracts Due: Sept 24, 2010 Papers Due: October 1, 2010 Focus: innovative and visionary approaches to data systems architecture and use. Regular CIDR track plus CCC-sponsored “outrageous ideas” track. Website coming soon! Save the Date(s): CIDR 2011 Conference CIDR 2011 Jan 9-12 Asilomar, CA 5th Biennial Conference on Innovative Data Systems Research

Continuous Improvement of Client Devices

Computing as a Commodity 4

Ubiquitous Connectivity

AMP: Algorithms, Machines, People Massive and Diverse Data

The Scalability Dilemma • State-of-the Art Machine Learning techniques do not scale to large data sets. • Data Analytics frameworks can’t handle lots of incomplete, heterogeneous, dirty data. • Processing architectures struggle with increasing diversity of programming models and job types. • Adding people to a late project makes it later. • Exactly Opposite of what we Expect and Need

RAD Lab 5-year Mission Enable 1 person to develop, deploy, operate a next-generation Internet application at scale Initial Technical Bet: • Machine Learning to make large-scale systems self-managing Multi-area faculty, postdocs, & students • Systems, Networks, Databases, Security, Statistical Machine Learning all in a single, open, collaborative space Corporate Sponsorship and intensive industry interaction • Bi-annual 2.5 day offsite research retreats with sponsors

PIQL + SCADS “Active PIQL” (don’t ask) PIQL: Query Interface &Executor Flexible Consistency Management SCADS: Distributed Key Value Store

SCADS: Scale Independent Storage

Scale Independence • As a site’s user base grows and workload volatility increases: • No changes to application required • Cost per user remains constant • Request latency SLA is unchanged • Key techniques • Model-Driven Scale Up and Scale Down • Performance Insightful Query Language • Declarative Performance/Consistency Tradeoffs

Over-provisioning a stateless systemWikipedia example overprovision by 25% to handle spike Michael Jackson dies

Over-provisioning a statefulsystemWikipedia example overprovision by 300% to handle spike (assuming data stored on ten servers)

Data storage configuration • Shared-nothing storage cluster • (key,value) pairs in a namespace, e.g. (user,email) • Each node stores set of data ranges, • Data ranges can be split until some minimum size promised by PIQL, to ensure range queries don’t touch more than one node A-C A-C F D-E D-E F-G G D-E

Workload-based policy stages Stage 1: Replicate threshold Workload Bins Storage nodes

Workload-based policy stages Stage 2: Data Movement destination threshold Workload Bins Storage nodes

Workload-based policy stages Stage 3: Server Allocation threshold Workload Bins Storage nodes

Workload-based policy Policy input: • Workload per histogram bin • Cluster configuration Policy output: • Short actions (per bin) Performance Model Policy smoothed workload Considerations: • Performance model • Overprovision buffer Action Executor • Limit actions to X kb/s actions Workload Histogram Action Executor config SCADS namespace sampled workload as histogram actions

Example Experiment Workload • Ebates.com + wikipedia’s MJ spike (see Bodik et al. SOCC 2010 for workload generation) • One million (key,value) pairs, each ~256 bytes Model: max sustainable workload per server Cost: • machine cost: 1 unit/10 minutes • SLA: 99th percentile of get/put latency Deployment • using m1.small instances on EC2, 1GB of RAM • server boot up time: 48 seconds • Delay server removal until 2 minutes left

Goal: selectively absorb hotspot thousand req / sec

Actions during the spike data movement and actions during the spike Kb/s 10:00 10:14 Add replica Move data, partition Move data, coalesce

Configuration at end of Spike Per server workload and # keys after added replicas

Cost-comparison to fixed and optimal • Fixed allocation policy: 648 server units • Optimal policy: 310 server units

PIQL [Armbrust et al. SIGMOD 2010 (demo) and SOCC 2010 (design paper)] • “Performance Insightful” language subset • Compiler reasons about operation bounds • Unbounded queries are disallowed • Queries above specified threshold generate a warning • Predeclare query templates: Optimizer decides what indexes are needed (i.e., materialized views) • Provides: Bounded number of operations • + Strong SLAs = Predictable performance? NoSQL RDBMS

PIQL DDL ENTITY Thought { inttimestamp, string owner, string text, FOREIGN KEY owner REFERENCES User PRIMARY KEY(owner, timestamp)} ENTITY User { string username, string password, PRIMARY KEY(username)} ENTITY Subscription { boolean approved, string owner, string target, FOREIGN KEY owner REF User, FOREIGN KEY target REF User MAX 5000, PRIMARY KEY(owner, target)} F.K.s are Required for Joins Cardinality Limits required for un-paginated Joins

More Queries “Return the most recent thoughts from all of my “approved” subscriptions.” Operations are bounded via schema and limit max

PIQL:Help Fix “Bad” Queries • Interactive Query Visualizer • Shows record counts and # ops • Highlights unbounded parts of query • SIGMOD’10 Demo: piql.knowsql.org NoSQL RDBMS

PIQL + SCADS • Goals are “Scale Independence” and “Performance Insightfulness” • SCADS provides scalable foundation with SLA adherence • PIQL uses language restrictions, schema limits, and precomputed views to bound # of SCADS operations per query. • These work together to bridge the gap between “SQL” and “NoSQL” worlds.

Spark: Support for Iterative Data-Intensive ComputingM. Zaharia et al. HotClouds Workshop 2010

Analytics: Logistic Regression Goal: find best line separating 2 datasets random initial line + + + + + + – + + – – + – + – – – – – – target

Serial Version val data = readData(...) var w = Vector.random(D) for (i <- 1 to ITERATIONS) { var gradient = Vector.zeros(D) for (p <- data) { val scale = (1/(1+exp(-p.y*(w dot p.x))) - 1) * p.y gradient += scale * p.x } w -= gradient } println("Final w: " + w)

Spark Version val data = spark.hdfsTextFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { var gradient = spark.accumulator(Vector.zeros(D)) for (p <- data) { val scale = (1/(1+exp(-p.y*(w dot p.x))) - 1) * p.y gradient += scale * p.x } w -= gradient.value } println("Final w: " + w)

Spark Version val data = spark.hdfsTextFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { var gradient = spark.accumulator(Vector.zeros(D)) data.foreach(p => { val scale = (1/(1+exp(-p.y*(w dot p.x))) - 1) * p.y gradient += scale * p.x }) w -= gradient.value } println("Final w: " + w)

Iterative Processing Dataflow w f(x,w) x f(x,w) w w x x f(x,w) . . . Hadoop / Dryad Spark

Performance 40s / iteration first iteration 60s further iterations 2s

What about the People?

Participatory Culture – “Indirect” John Murrell: GM SV 9/17/09 …every time we use a Google app or service, we are working on behalf of the search sovereign, creating more content for it to index and monetize or teaching it something potentially useful about our desires, intentions and behavior.

Participatory Culture - Direct

Crowdsourcing Example From: Yan, Kumar, Ganesan, CrowdSearch: Exploiting Crowds for Accurate Real-time Image Search on Mobile Phones, Mobisys 2010.

Mechanical Turk vs. Cluster Computing • What challenges are similar? • What challenges are new? • Allocation, Cost, Reliability, Quality, Bias, Making jobs appealing, ….

AMP: Algorithms, Machines, People Massive and Diverse Data

Clouds and Crowds The Future Cloud will be a Hybrid of These.

AMPLab Technical Plan • Machine Learning & Analytics (Jordan, Fox, Franklin) • Error Bars on all Answers • Active learning, continuous/adaptive improvement • Data Management (Franklin, Joseph) • Pay-as-you-go integration and structure • Privacy • Infrastructure (Stoica, Shenker, Patterson, Katz) • Nexus cloud OS and analytics languages • Hybrid Crowd/Cloud Systems (Bayen, Waddell) • Incentive structures, systems aspects

Guiding Use Cases • Crowdsourced Sensing, Work, Policy, Journalism • Urban Micro-Simulation

Alogorithms, Machines & People Enable many people to collaborate to collect, generate, clean, make sense of and utilize lots of data. • A holistic view of the entire stack. • Highly interdisciplinary faculty & students • Developing a five-year plan; will dovetail with RADLab completion For more information: franklin@cs.berkeley.edu

Revolutionizing Data Systems: Strategies for Scalability