Sketching, Sampling and other Sublinear Algorithms: Streaming

Sketching, Sampling and other Sublinear Algorithms:Streaming Alex Andoni (MSR SVC)

A scenario Challenge: compute something on the table, using small space. 131.107.65.14 18.9.22.69 • Example of “something”: • # distinct IPs • max frequency • other statistics… 131.107.65.14 80.97.56.20 18.9.22.69 80.97.56.20 131.107.65.14

Sublinear: a panacea? • Sub-linear space algorithm for solving Travelling Salesperson Problem? • Sorry, perhaps a different lecture • Hard to solve sublinearly even very simple problems: • Ex: what is the count of distinct IPs seen • Will settle for: • Approximate algorithms: 1+ approximation true answer ≤ output≤ (1+) * (true answer) • Randomized: above holds with probability 95% • Quick and dirty way to get a sense of the data

Streaming data • Data through a router • Data stored on a hard drive, or streamed remotely • More efficient to do a linear scan on a hard drive • Working memory is the (smaller) main memory 2 2

Application areas • Data can come from: • Network logs, sensor data • Real time data • Search queries, served ads • Databases (query planning) • …

Problem 1: # distinct elements • Problem: compute the number of distinct elements in the stream • Trivial solution: space for distinct elements • Will see: space (approximate) 2 5 7 5 5

Distinct Elements: idea 1 [Flajolet-Martin’85, Alon-Matias-Szegedy’96] Algorithm DISTINCT: Initialize: minHash=1 hash function h into [0,1] Process(inti): if (h(i) < minHash) minHash = h(index); Output: 1/minHash-1 • Algorithm: • Hash function • Compute • Output is • “Analysis”: • repeats of the same element idon’t matter • , for distinct elements 7 2 5

Distinct Elements: idea 2 Algorithm DISTINCT: Initialize: minHash2=0 hash function h into [0,1] Process(inti): if (h(i) < 1/2^minHash2) minHash2 = ZEROS(h(index)); Output: 2^minHash2 Algorithm DISTINCT: Initialize: minHash=1 hash function h into [0,1] Process(inti): if (h(i) < minHash) minHash = h(index); Output: 1/minHash-1 • Store approximately • Store just the count of trailing zeros • Need only bits • Randomness: 2-wise enough! • bits • Better accuracy using more space: • error • repeat times with different hash functions • HyperLogLog: can also with just one hash function[FFGM’07] x=0.0000001100101 ZEROS(x)

Problem 2: max count heavy hitters • Problem: compute the maximum frequency of an element in the stream • Bad news: • Hard to distinguish whether an element repeated (max = 1 vs 2) • Good news: • Can find “heavy hitters” • elements with frequency > total frequency / s • using space proportional to s 2 5 7 5 5

Heavy Hitters: CountMin [Charikar-Chen-FarachColton’04, Cormode-Muthukrishnan’05] AlgorithmCountMin: Initialize(r, L): array Sketch[L][w] L hash functions h[L], into {0,…w-1} Process(inti): for(j=0; j<L; j++) Sketch[j][ h[j](i) ] += 1; Output: foreachi in PossibleIP { freq[i] = int.MaxValue; for(j=0; j<L; j++) freq[i] = min(freq[i], Sketch[j][h[j](i)]); } // freq[] is the frequency estimate 7 2 5 5 5 11 freq freq freq freq

Heavy Hitters: analysis 5 • = frequency of 5, plus “extra mass” • Expected “extra mass” ≤ total mass / w • Chebyshev: true with probability >1/2 • to get high probability (for all elements) • Compute heavy hitters from freq[] AlgorithmCountMin: Initialize(r, L): array Sketch[L][w] L hash functions h[L], into {0,…w-1} Process(inti): for(j=0; j<L; j++) Sketch[j][ h[j](i) ] += 1; Output: foreachi in PossibleIP { freq[i] = int.MaxValue; for(j=0; j<L; j++) freq[i] = min(freq[i], Sketch[j][h[j](i)]); } // freq[] is the frequency estimate 3

Problem 3: Moments • Problem: compute frequency moment • variance or • higher moments for • Skewness (k=3), kurtosis (k=4), etc • a different proxy for max: 1+81+16=98 1+9+4=14

moment • Use Johnson-Lindenstrauss lemma! (2nd lecture) • Store sketch • = frequency vector • = by matrix of Gaussian entries • Update on element : • Guarantees: • counters (words) • time to update • Better: entries, update [AMS’96, TZ’04] • : precision sampling => next

Scenario 2: distributed traffic • Statistics on traffic difference/aggregate between two routers • Eg: traffic different by how many packets? • Linearity is the power! • Sketch(data ) + Sketch(data ) = Sketch(data + data ) • Sketch(data ) - Sketch(data ) = Sketch(data - data ) 131.107.65.14 35.8.10.140 18.9.22.69 18.9.22.69 • Two sketches should be sufficient to compute • something on the difference or sum

Common primitive: estimate sum • Given: quantities in the range • Goal: estimate “cheaply” • Standard sampling: pick random set of size • Estimator: • Chebyshev bound: with 90% success probability • For constant additive error, need Compute an estimate from a3 a1 a2 a4 a3 a1

Precision Sampling Framework • Alternative “access” to ’s: • For each term , we get a (rough) estimate • up to some precision, chosen in advance: • Challenge: achieve good trade-off between • quality of approximation to • use only weak precisions (minimize “cost” of estimating ) Compute an estimate from u4 u1 u2 u3 ã3 ã4 ã1 ã2 a2 a4 a3 a1

Formalization • What is cost? • Here, average cost = • to achieve precision , use “resources”: e.g., if is itself a sum computed by subsampling, then one needs samples • For example, can choose all • Average cost ≈ Sum Estimator Adversary 1. fix precisions 1. fix • 2. fix s.t. 3. given , output s.t. .

Precision Sampling Lemma [A-Krauthgamer-Onak’11] • Goal: estimate ∑aifrom {ãi} satisfying |ai-ãi|<ui. • Precision Sampling Lemma: can get, with 90% success: • O(1) additive error and 1.5 multiplicative error: S – O(1) < S̃ < 1.5*S + O(1) • with average cost equal to O(log n) • Example: distinguish Σai=3 vsΣai=0 • Consider two extreme cases: • if three ai=1: enough to have crude approx for all (ui=0.1) if all ai=3/n: only few with good approxui=1/n, and the rest with ui=1 ε 1+ε S – ε < S̃ < (1+ ε)S + ε O(ε-3 log n)

Precision Sampling Algorithm • Precision Sampling Lemma: can get, with 90% success: • O(1) additive error and 1.5 multiplicative error: S – O(1) < S̃ < 1.5*S + O(1) • with average cost equal to O(log n) • Algorithm: • Choose each ui[0,1] i.i.d. • Estimator: S̃ = count number of i‘s s.t. ãi / ui > 6 (up to a normalization constant) • Proof of correctness: • we use only ãi which are 1.5-approximation to ai • E[S̃] ≈ ∑ Pr[ai / ui > 6] = ∑ ai/6. • E[1/ui] = O(log n) w.h.p. ε 1+ε S – ε < S̃ < (1+ ε)S + ε O(ε-3 log n) concrete distrib. = minimum of O(ε-3) u.r.v. function of [ãi /ui - 4/ε]+and ui’s

Moments () via precision sampling • Theorem: linear sketch for with approximation, and space (90% succ. prob.). • Sketch: • Pick random , and let • throw into one hash table , • cells • Estimator: • Randomness: independence suffices x= H=

Streaming++ • LOTS of work in the area: • Surveys • Muthukrishnan: http://algo.research.googlepages.com/eight.ps • McGregor: http://people.cs.umass.edu/~mcgregor/papers/08-graphmining.pdf • Chakrabarti: http://www.cs.dartmouth.edu/~ac/Teach/CS49-Fall11/Notes/lecnotes.pdf • Open problems: http://sublinear.info • Examples: • Moments, sampling • Median estimation, longest increasing sequence • Graph algorithms • E.g., dynamic graph connectivity [AGG’12, KKM’13,…] • Numerical algorithms (e.g., regression, SVD approximation) • Fastest (sparse) regression […CW’13,MM’13,KN’13,LMP’13] • related to Compressed Sensing

Sketching, Sampling and other Sublinear Algorithms: Streaming

Sketching, Sampling and other Sublinear Algorithms: Streaming

Presentation Transcript

Sketching Massive Distributed Data Streams

Graph Algorithms Using Depth First Search

Number Theory Algorithms and Cryptography Algorithms

CHAPTER 7, the logic of sampling

CHAPTER 3 SECTION 3.6 CURVE SKETCHING

Sampling Error

Analysis of Algorithms

Greedy Algorithms

Algorithms

Online Algorithms

Parallel Algorithms and Computing Selected topics

Genetic Algorithms

Sublinear Algorithms

Wireless Sensor Networks for High Fidelity Sampling

Sampling and Sample Size in Epidemiology

Design and Analysis of Algorithms

Algorithms -- What we’ll do

Genetic Algorithms

Parallel Algorithms and Computing Selected topics

Steganography in Streaming Media 网络流媒体信息隐藏

Sample size calculation and development of sampling plan

CPSC 411 Design and Analysis of Algorithms