440 likes | 565 Views
This research presents innovative algorithms for large-scale computations in big data analytics, focusing on sampling-based range partitioning. With the increasing scale of data—processing hundreds of millions of tweets and billions of shared contents daily—our approach aims to enhance efficiency in handling large datasets. We explore balanced range partitioning, accurate sampling methods, and effective data merging strategies. The study outlines several techniques, challenges, and empirical guarantees for practical applications, paving the way for future advancements in distributed computing systems.
E N D
Sampling Based Range Partition for Big Data Analytics+ Some Extras Milan Vojnović Microsoft Research Cambridge, United Kingdom Joint work with Charalampos Tsourakakis, Bozidar Radunovic, Zhenming Liu, Fei Xu, Jingren Zhou INQUEST Workshop, September 2012
Big Data Analytics • Our goal: innovation in the area of algorithms for large scale computations to move the frontier of the computer science of big data • Some figures of scale • Peta / Tera bytes of online services data processed daily • 200M tweets per day (Twitter) • 1B of content pieces shared per day (Facebook) • 8,000 Exabytes of global data by 2015 (The Economist)
Machine learning Database queries Optimization Research Agenda Distributed computing system
Outline • Range Partitionwith Fei Xu and Jingren Zhou • Count Tracking with Zhenming Liu and Bozidar Radunovic • Graph Partitioning (def. only) with Charalampos Tsourakakis and Bozidar Radunovic
Range Partition 101-250 1-100 950-1024 • Special interest: balanced range partition . . . 1 2 m (120,4) (120,10) (120,5) 52 8 120 . . . 1 23 52 120 120 8 83 1 23 83 1 23 24 1024 24 24 1024 1024 1 k 2
Range Partition Requirements • Given and and desired relative partition sizes • -accurate range partition:with probability at least = number of data items assigned to range
Two Approaches • Sampling based methods • Take a sample of data items • Compute partition boundaries using the sample • Quantile summary methods • At each node compute a local quantile summary • Merge at the coordinator node
Related Work • Sampling based estimation of histograms studied by Chaudhuri, Motwani and Narasayya (ACM SIGMOD 1998) Required sample size: • Communication cost to draw samples without replacement (Trithapura and Woodruff, 2011) : For therwise:
Related Work (cont’d) • Quantile summaries based approach (Greenwald and Khanna, 2001) Communication cost = • Pros • Deterministic guarantee • Cons • It requires sorting of data items • Largest frequency of an item must be at most
Problem • Range partition data while making one pass through data with minimal communication between the coordinator and sites
Sampling Based Method • Collect samples and partition using the samples 1 2 coordinator . . . • Pros • simplicity, scalability • Cons • how many samples to take from each site?data size imbalance: number of data input records per machine may differ from one machine to another k
Origins of Data Sizes Imbalance • JOINSELECT FROM A INNER JOIN B ON A.KEY==B.KEY ORDER BY COL • Lookup TableIf the record value of column X is in the lookup table, then return the row • UNPIVOTInput: Col 1 Col 2 1 2, 3 2 3, 9, 8, 13 … Output: (1,2), (1,3), (2,3), (2,9), …
Weighted Sampling Scheme • SAMPLE: Each site reports a random sample of t/k data items and the total number of items • MERGE: Summary created by adding each data item from site for times • PARTITION: Use the summary to determine partition boundaries Note: the total number of data items reported by a site only once available – the site made one pass through local data
SAMPLE 1 2 coordinator . . . k
MERGE . . . replicas coordinator . . .
PARTITION Empirical CDF of data summary 1 coordinator 0 Range 1 2 3 4 5
Sufficient Sample Size • Assume For sample size-accurate range partition w. p. • largest frequency of a data value
Constant Factor Imbalance • Suppose that for some • Then
Proof Outline • Large deviation analysis of the error exponent:
Performance • DataSet-1 • 100K data records per range,
Summary for Range Partitioning • Novel weighted sampling scheme • Provable performance guarantees • Simple and practical • Coder transfer to Cosmos • More info:Sampling Based Range Partition Methods for Big Data Analytics, V., Xu, Zhou, MSR-TR-2012-18, Mar 2012
Outline • Range Partitionwith Fei Xu and Jingren Zhou • Count Tracking with Zhenming Liu and Bozidar Radunovic • Graph Partitioning (def. only) with Charalampos Tsourakakis and Bozidar Radunovic
SUM Tracking Problem : Maintain estimate k 1 2 3 SUM:
Applications • Ex 1: database queriesSELECT SUM(AdBids)from Ads • Ex 2: iterative solving input data
State of the Art • Count tracking [Huang, Yi and Zhang, 2011] • Worst-case input, monotonic sum • Expected total communication: messages • Lower bound for worst case input[Arackaparambil, Brody and Chakrabarti, 2009] • Expected total communication messages
The Challenge • Q: What are communication cost efficient algorithms for the sum tracking problem with random input streams? • Random permutation • Random i.i.d. • Fractional Brownian motion
Communication Complexity Bounds • Lower bound: • Upper bound: Sublinear, “price of non-monotonicity”:
Communication Complexity BoundsUnknown Drift Case • Input: i.i.d. Bernoulli : unknown drift parameter Expected total communication: messages • Generalizes monotonic case to constant drift case
Our Tracker Algorithm • Each site reports to the coordinator upon receiving a value update with probability • Sync all whenever the coordinator receives an update from a site S S1 S = S1+ … + Sk S, S1 site coordinator Mi = 1 Sk S S, Sk Xi site
Two Applications • Second Frequency Moment • Bayesian Linear Regression
App 1: Second Frequency Moment • Input: • Counter of value : • Second frequency moment: • Goal: track within relative accuracy
AMS Sketch {0,1} valued hash • For and , within w. p.
App 1: Second Frequency Moment (cont’d) • Sum tracking: • Expected total communication:
App 2: Bayesian Linear Regression • Feature vector , output • Prior osterior
App 2: Bayesian Linear Regression (cont’d) • Posterior mean and precision: • Sum tracking: • Under random permutation input, the expected communication cost =
Summary for Sum Tracking • Studied the sum tracking problem with non-monotonic distributed streams under random permutation, random i. i. d. and fractional Brownian motion • Proposed a novel algorithm with nearly optimal communication complexity • Details: ACM PODS 2012
Outline • Range Partitionwith Fei Xu and Jingren Zhou • Count Tracking with Zhenming Liu and Bozidar Radunovic • Graph Partitioning (def. only) with Charalampos Tsourakakis and Bozidar Radunovic
Problem • Partition a graph with two objectives • Sparsely connected components • Balanced number of vertices per component • Applications • Parallel processing • Community detection
Problem (cont’d) • Requirements • Streaming algorithm • Single pass / incremental • Efficient computing • Desired • Approximation guarantees • Average-case efficient k 1 2 3
Summary for Graph Partitioning • Designed a streaming algorithm whose average-case performance appears superior to any of previously proposed online heuristics • Provable approximation guarantees • More details available soon