Sampling Based Range Partition for Big Data Analytics + Some Extras

Sampling Based Range Partition for Big Data Analytics+ Some Extras Milan Vojnović Microsoft Research Cambridge, United Kingdom Joint work with Charalampos Tsourakakis, Bozidar Radunovic, Zhenming Liu, Fei Xu, Jingren Zhou INQUEST Workshop, September 2012

Big Data Analytics • Our goal: innovation in the area of algorithms for large scale computations to move the frontier of the computer science of big data • Some figures of scale • Peta / Tera bytes of online services data processed daily • 200M tweets per day (Twitter) • 1B of content pieces shared per day (Facebook) • 8,000 Exabytes of global data by 2015 (The Economist)

Machine learning Database queries Optimization Research Agenda Distributed computing system

Outline • Range Partitionwith Fei Xu and Jingren Zhou • Count Tracking with Zhenming Liu and Bozidar Radunovic • Graph Partitioning (def. only) with Charalampos Tsourakakis and Bozidar Radunovic

Range Partition 101-250 1-100 950-1024 • Special interest: balanced range partition . . . 1 2 m (120,4) (120,10) (120,5) 52 8 120 . . . 1 23 52 120 120 8 83 1 23 83 1 23 24 1024 24 24 1024 1024 1 k 2

Range Partition Requirements • Given and and desired relative partition sizes • -accurate range partition:with probability at least = number of data items assigned to range

Two Approaches • Sampling based methods • Take a sample of data items • Compute partition boundaries using the sample • Quantile summary methods • At each node compute a local quantile summary • Merge at the coordinator node

Related Work • Sampling based estimation of histograms studied by Chaudhuri, Motwani and Narasayya (ACM SIGMOD 1998) Required sample size: • Communication cost to draw samples without replacement (Trithapura and Woodruff, 2011) : For therwise:

Related Work (cont’d) • Quantile summaries based approach (Greenwald and Khanna, 2001) Communication cost = • Pros • Deterministic guarantee • Cons • It requires sorting of data items • Largest frequency of an item must be at most

Problem • Range partition data while making one pass through data with minimal communication between the coordinator and sites

Sampling Based Method • Collect samples and partition using the samples 1 2 coordinator . . . • Pros • simplicity, scalability • Cons • how many samples to take from each site?data size imbalance: number of data input records per machine may differ from one machine to another k

Data Sizes Imbalance

Origins of Data Sizes Imbalance • JOINSELECT FROM A INNER JOIN B ON A.KEY==B.KEY ORDER BY COL • Lookup TableIf the record value of column X is in the lookup table, then return the row • UNPIVOTInput: Col 1 Col 2 1 2, 3 2 3, 9, 8, 13 … Output: (1,2), (1,3), (2,3), (2,9), …

Weighted Sampling Scheme • SAMPLE: Each site reports a random sample of t/k data items and the total number of items • MERGE: Summary created by adding each data item from site for times • PARTITION: Use the summary to determine partition boundaries Note: the total number of data items reported by a site only once available – the site made one pass through local data

SAMPLE 1 2 coordinator . . . k

MERGE . . . replicas coordinator . . .

PARTITION Empirical CDF of data summary 1 coordinator 0 Range 1 2 3 4 5

Sufficient Sample Size • Assume For sample size-accurate range partition w. p. • largest frequency of a data value

Constant Factor Imbalance • Suppose that for some • Then

Proof Outline • Large deviation analysis of the error exponent:

Performance • DataSet-1 • 100K data records per range,

Performance (cont’d)

Summary for Range Partitioning • Novel weighted sampling scheme • Provable performance guarantees • Simple and practical • Coder transfer to Cosmos • More info:Sampling Based Range Partition Methods for Big Data Analytics, V., Xu, Zhou, MSR-TR-2012-18, Mar 2012

SUM Tracking Problem : Maintain estimate k 1 2 3 SUM:

SUM Tracking

Applications • Ex 1: database queriesSELECT SUM(AdBids)from Ads • Ex 2: iterative solving input data

State of the Art • Count tracking [Huang, Yi and Zhang, 2011] • Worst-case input, monotonic sum • Expected total communication: messages • Lower bound for worst case input[Arackaparambil, Brody and Chakrabarti, 2009] • Expected total communication messages

The Challenge • Q: What are communication cost efficient algorithms for the sum tracking problem with random input streams? • Random permutation • Random i.i.d. • Fractional Brownian motion

Communication Complexity Bounds • Lower bound: • Upper bound: Sublinear, “price of non-monotonicity”:

Communication Complexity BoundsUnknown Drift Case • Input: i.i.d. Bernoulli : unknown drift parameter Expected total communication: messages • Generalizes monotonic case to constant drift case

Our Tracker Algorithm • Each site reports to the coordinator upon receiving a value update with probability • Sync all whenever the coordinator receives an update from a site S S1 S = S1+ … + Sk S, S1 site coordinator Mi = 1 Sk S S, Sk Xi site

Two Applications • Second Frequency Moment • Bayesian Linear Regression

App 1: Second Frequency Moment • Input: • Counter of value : • Second frequency moment: • Goal: track within relative accuracy

AMS Sketch {0,1} valued hash • For and , within w. p.

App 1: Second Frequency Moment (cont’d) • Sum tracking: • Expected total communication:

App 2: Bayesian Linear Regression • Feature vector , output • Prior osterior

App 2: Bayesian Linear Regression (cont’d) • Posterior mean and precision: • Sum tracking: • Under random permutation input, the expected communication cost =

Summary for Sum Tracking • Studied the sum tracking problem with non-monotonic distributed streams under random permutation, random i. i. d. and fractional Brownian motion • Proposed a novel algorithm with nearly optimal communication complexity • Details: ACM PODS 2012

Problem • Partition a graph with two objectives • Sparsely connected components • Balanced number of vertices per component • Applications • Parallel processing • Community detection

Problem (cont’d) • Requirements • Streaming algorithm • Single pass / incremental • Efficient computing • Desired • Approximation guarantees • Average-case efficient k 1 2 3

Summary for Graph Partitioning • Designed a streaming algorithm whose average-case performance appears superior to any of previously proposed online heuristics • Provable approximation guarantees • More details available soon

Sampling Based Range Partition for Big Data Analytics + Some Extras

Sampling Based Range Partition for Big Data Analytics + Some Extras

Presentation Transcript

Big Data Analytics

Big Data + Data Analytics

Pentaho Analytics for Big Data

Big Data analytics

Big data analytics for DEVELOPMENT

Big Data Analytics

Big data analytics

Big data analytics for Development

Data Analytics for Big Data

Sampling for Big Data

Big Data Analytics

Big Data analytics

analytics platform for big data

Big Data Analytics

Analytics tools for big data

Big Data Analytics

Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutorial | Edureka

Golang For Big Data Analytics

Big Data Analytics

Data Analytics for Big Data

Big Data Analytics Solutions for Businesses | Big Data