Leveraging Big Data: Lecture 2

http://www.cohenwang.com/edith/bigdataclass2013 Leveraging Big Data: Lecture 2 Edith Cohen Amos Fiat Haim Kaplan Tova Milo Instructors:

Counting Distinct Elements 4, 32, 6, 12, 12, 14, 32, 7, 12, 32, 7, • Elements occur multiple times, we want to count the number of distinct elements. • Number of distinct element is ( =6 in example) • Total number of elements is 11in this example Exact counting of distinct element requires a structure of size We are happy with an approximate count that uses a small-size working memory.

Distinct Elements: Approximate Counting 4, 32, 6, 12, 12, 14, 32, 7, 12, 32, 7, We want to be able to compute and maintain a small sketch of the set of distinct items seen so far

Distinct Elements: Approximate Counting • Size of sketch • Can query to get a good estimate of (small relative error) • For a new element easy to compute fromand • For data stream computation • If andare (possibly overlapping) sets then we can compute the union sketch from their sketches: from and • For distributed computation

Distinct Elements: Approximate Counting 4, 32, 6, 12, 12, 14, 32, 7, 12, 32, 7, Size-estimation/Minimum value technique: [Flajolet-Martin 85, C 94] is a random hash function from element IDs to uniform random numbers in Maintain the Min-Hash value : • Initialize • Processing an element

Distinct Elements: Approximate Counting 4, 32, 6, 12, 12, 14, 32, 7, 12, 32, 7, The minimum hash value is: Unaffected by repeated elements. Is non-increasing with the number of distinct elements .

Distinct Elements: Approximate Counting How does the minimum hash give information on the number of distinct elements ? minimum The expectation of the minimum is A single value gives only limited information. To boost information, we maintain values

Why expectation is ? • Take a circle of length 1 • Throw a random red point to “mark” the start of a segment of length (circle points map to ) • Throw another point independently at random • The circle is cut into segments by these points. • The expected length of each segment is • Same also for the segment clockwise from the red point.

Min-Hash Sketches These sketches maintain values from the range of the hash function (distribution). k-mins sketch: Use “independent” hash functions: Track the respective minimum for each function. Bottom-k sketch: Use a single hash function: Track the smallest values k-partition sketch: Use a single hash function: Use the first bits of to map uniformly to one of parts. Call the remaining bits . For : Track the minimum hash value of the elements in part . All sketches are the same for

Min-Hash Sketches k-mins, bottom-k, k-partition Why study all 3 variants ? Different tradeoffs between update cost, accuracy, usage… Beyond distinct counting: • Min-Hash sketches correspond to sampling schemes of large data sets • Similarity queries between datasets • Selectivity/subset queries • These patterns generally apply as methods to gather increased confidence from a random “projection”/sample.

Min-Hash Sketches: Examples k-mins, k-partition, bottom-k 6 4 32 7 12 14 The min-hash value and sketches only depend on • The random hash function/s • The set of distinct elements Not on the order elements appear or their multiplicity

Min-Hash Sketches: Example k-mins 6 4 32 7 12 14

Min-Hash Sketches: k-mins k-mins sketch: Use “independent” hash functions: Track the respective minimum for each function. Processing a new element : For : Computation: Whether sketch is actually updated or not.

Min-Hash Sketches: Example k-partition 6 4 32 7 12 14 part-hash value-hash

Min-Hash Sketches: k-partition k-partition sketch: Use a single hash function: Use the first bits of to map uniformly to one of parts. Call the remaining bits . For : Track the minimum hash value of the elements in part . Processing a new element : Computation: to test or update

Min-Hash Sketches: Example Bottom-k 6 4 32 7 12 14

Min-Hash Sketches: bottom-k Bottom-k sketch: Use a single hash function: Track the smallest values Processing a new element : If: Computation: The sketch is maintained as a sorted list or as a priority queue. • to test if an update is needed • to update a sorted list. to update a priority queue. We will see that #changes #distinct elements

Min-Hash Sketches: Number of updates Claim: The expected number of actual updates (changes) of the min-hash sketch is Proof: First Consider . Look at distinct elements in the order they first occur. The thdistinct element has lower hash value than the current minimum with probability . This is the probability of being first in a random permutation of elements. Total expected number of updates is . 4, 32, 6, 12, 12, 14, 32, 7, 12, 32, 7, Update Prob.

Min-Hash Sketches: Number of updates Claim: The expected number of actual updates (changes) of the min-hash sketch is Proof (continued): Recap for (single min-hash value): the th distinct element causes an update with probability expected total is . k-mins: min-hash values (apply times) Bottom-k: We keep the smallest elements, so update probability of the th distinct element is (probability of being in the first in a random permutation) k-partition: min-hash values for distinct values.

Merging Min-Hash Sketches !! We apply the same set of hash function to all elements/data sets/streams. The union sketch from sketches of two sets ’,: • k-mins: take minimum per hash function • k-partition:take minimum per part • Bottom-k: The smallest in union of data must be in the smallest of their own set:

Using Min-Hash Sketches • Recap: • We defined Min-Hash Sketches (3 types) • Adding elements, merging Min-Hash sketches • Some properties of these sketches • Next: We put Min-Hash sketches to work • Estimating Distinct Count from a Min-Hash Sketch • Tools from estimation theory

The Exponential Distribution • PDF ; CDF ; • Very useful properties: • Memorylessness: • Min-to-Sum conversion: • Relation with uniform:

Estimating Distinct Count from a Min-Hash Sketch: k-mins • Change to exponential distribution • Using Min-to-Sum property, • In fact, we can just work with and use when estimating. • Number of distinct elements becomes a parameter estimation problem: Given independent samples from , estimate

Estimating Distinct Count from a Min-Hash Sketch: k-mins • Each has expectation and variance • The average has expectation and variance The cv is . • is a good unbiased estimator for • But which is the inverse of what we want. What about estimating ?

Estimating Distinct Count from a Min-Hash Sketch: k-mins What about estimating ? • We can use the biased estimator • To say something useful on the estimate quality: We apply Chebyshev’s inequality to bound the probability that is far from its expectation and thus is far from • Maximum Likelihood Estimation (general and powerful technique)

Chebyshev’s Inequality For any random variable with expectation and standard deviation , for any For , For Using

Using Chebyshev’sInequality For =

Maximum Likelihood Estimation Set of independent ; we do not know The MLE is the value that maximizes the likelihood (joint density) function . The maximum over of the probability of observing Properties: • Principled way of deriving estimators • Converges in probability to true value (with enough i.i.d samples)… but generally biased • (Asymptotically!) optimal – minimizes MSE (mean square error) – meets Cramér-Rao lower bound

Estimating Distinct Count from a Min-Hash Sketch: k-mins MLE Given independent samples from , estimate • Likelihood function for (joint density function): • Take a logarithm (does not change the maximum): • Differentiate to find maximum: • MLE estimate We get the same estimator, depends only on the sum!

Given independent samples from , estimate We can think of several ways to combine and use these samples and decrease the variance: • average (sum) • median • remove outliers and average remaining, … We want to get the most value (best estimate) from the information we have (the sketch). What combinations should we consider ?

Sufficient Statistic A function is a sufficient statistic for estimating some function of the parameter if the likelihood function has the factored form Likelihood function (joint density) for exponential i.i.d random variables from : The sum is a sufficient statistic for

Sufficient Statistic A function is a sufficient statistic for estimating some function of the parameter if the likelihood function has the factored form In particular: The MLE depends on only through • The maximum with respect to does not depend on . • The maximum of , computed by deriving with respect to , is a function of T.

Sufficient Statistic is a sufficient statistic for if the likelihood function has the form Lemma: Conditional distribution of given does not depend on If we fix , the density function is If we know the density up to fixed factor, it is determined completely by normalizing to 1

Rao-Blackwell Theorem Recap: is a sufficient statistic for Conditional distribution of given does not depend on Rao-Blackwell Theorem: Given an estimator of that is not a function of the sufficient statistic, we can get an estimator with at most the same MSE that depends only on : • does not depend on (critical) • Process is called: Rao-Blackwellization of

Rao-Blackwell Theorem Density function of given parameter (2,2) (2,1) (1,3) (3,2) (3,0) (4,0) (1,2) (1,4) (3,1)

Rao-Blackwell Theorem Sufficient statistic: T (2,2) (2,1) (1,3) (3,2) (3,0) (4,0) (1,2) (1,4) (3,1)

Rao-Blackwell Theorem Estimator T 1 (2,2) (2,1) (1,3) 3 2 (3,2) 2 (3,0) 0 0 (4,0) 2 (1,2) 1 (1,4) (3,1) 4

Rao-Blackwell Theorem T Rao-Blackwell: 1 (2,2) (2,1) (1,3) 3 2 1 1.5 (3,2) 2 (3,0) 0 0 (4,0) 2 (1,2) 3 1 (1,4) (3,1) 4

Rao-Blackwell Theorem T Rao-Blackwell: • Law of total expectation: Expectation (bias) remains the same • MSE (Mean Square Error) can only decrease

Why does the MSE decrease? • Suppose we have two points with equal probabilities. We have an estimator of that gives estimates and on these points. • We replace it by an estimator that instead returns the average: • The (scaled) contribution of these two points to the square error changes from to

Why does the MSE decrease? Show that

Sufficient Statistic for estimating from k-mins sketches Given independent samples from , estimate • The sum is a sufficient statistic for estimating any function of (including ,) • Rao-Blackwell We can not gain by using estimators with a different dependence on (e.g. functions of median or of a smaller sum)

Estimating Distinct Count from a Min-Hash Sketch: k-mins MLE MLE estimate • , the sum of i.i.drandom variables), has PDF The expectation of the MLE estimate is

Estimating Distinct Count from a Min-Hash Sketch: k-mins Unbiased Estimator (for ) The variance of the unbiased estimate is The CV is Is this the best we can do ?

Cramér-Rao lower bound (CRLB) Are we using the information in the sketch in the best possible way ?

Cramér-Rao lower bound (CRLB) Information theoretic lower bound on the variance of any unbiased estimator of . Likelihood function: Log likelihood: Fisher Information CRLB: Any unbiased estimator has

CRLB for estimating • Likelihood function for • Log likelihood • Negated second derivative: • Fisher information: • CRLB :

Estimating Distinct Count from a Min-Hash Sketch: k-mins Unbiased Estimator (for ) Our estimator has CV The Cramér-Rao lower bound on CV is we are using the information in the sketch nearly optimally !

Leveraging Big Data: Lecture 2