1 / 59

Leveraging Big Data: Lecture 2

http://www.cohenwang.com/edith/bigdataclass2013. Leveraging Big Data: Lecture 2. Edith Cohen Amos Fiat Haim Kaplan Tova Milo. Instructors:. Counting Distinct Elements. 4,. 32,. 6 ,. 12,. 12,. 1 4 ,. 32 ,. 7 ,. 12,. 32,. 7,.

elle
Download Presentation

Leveraging Big Data: Lecture 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. http://www.cohenwang.com/edith/bigdataclass2013 Leveraging Big Data: Lecture 2 Edith Cohen Amos Fiat Haim Kaplan Tova Milo Instructors:

  2. Counting Distinct Elements 4, 32, 6, 12, 12, 14, 32, 7, 12, 32, 7, • Elements occur multiple times, we want to count the number of distinct elements. • Number of distinct element is ( =6 in example) • Total number of elements is 11in this example Exact counting of distinct element requires a structure of size We are happy with an approximate count that uses a small-size working memory.

  3. Distinct Elements: Approximate Counting 4, 32, 6, 12, 12, 14, 32, 7, 12, 32, 7, We want to be able to compute and maintain a small sketch of the set of distinct items seen so far

  4. Distinct Elements: Approximate Counting • Size of sketch • Can query to get a good estimate of (small relative error) • For a new element easy to compute fromand • For data stream computation • If andare (possibly overlapping) sets then we can compute the union sketch from their sketches: from and • For distributed computation

  5. Distinct Elements: Approximate Counting 4, 32, 6, 12, 12, 14, 32, 7, 12, 32, 7, Size-estimation/Minimum value technique: [Flajolet-Martin 85, C 94] is a random hash function from element IDs to uniform random numbers in Maintain the Min-Hash value : • Initialize • Processing an element

  6. Distinct Elements: Approximate Counting 4, 32, 6, 12, 12, 14, 32, 7, 12, 32, 7, The minimum hash value is: Unaffected by repeated elements. Is non-increasing with the number of distinct elements .

  7. Distinct Elements: Approximate Counting How does the minimum hash give information on the number of distinct elements ? minimum The expectation of the minimum is A single value gives only limited information. To boost information, we maintain values

  8. Why expectation is ? • Take a circle of length 1 • Throw a random red point to “mark” the start of a segment of length (circle points map to ) • Throw another point independently at random • The circle is cut into segments by these points. • The expected length of each segment is • Same also for the segment clockwise from the red point.

  9. Min-Hash Sketches These sketches maintain values from the range of the hash function (distribution). k-mins sketch: Use “independent” hash functions: Track the respective minimum for each function. Bottom-k sketch: Use a single hash function: Track the smallest values k-partition sketch: Use a single hash function: Use the first bits of to map uniformly to one of parts. Call the remaining bits . For : Track the minimum hash value of the elements in part . All sketches are the same for

  10. Min-Hash Sketches k-mins, bottom-k, k-partition Why study all 3 variants ? Different tradeoffs between update cost, accuracy, usage… Beyond distinct counting: • Min-Hash sketches correspond to sampling schemes of large data sets • Similarity queries between datasets • Selectivity/subset queries • These patterns generally apply as methods to gather increased confidence from a random “projection”/sample.

  11. Min-Hash Sketches: Examples k-mins, k-partition, bottom-k 6 4 32 7 12 14 The min-hash value and sketches only depend on • The random hash function/s • The set of distinct elements Not on the order elements appear or their multiplicity

  12. Min-Hash Sketches: Example k-mins 6 4 32 7 12 14

  13. Min-Hash Sketches: k-mins k-mins sketch: Use “independent” hash functions: Track the respective minimum for each function. Processing a new element : For : Computation: Whether sketch is actually updated or not.

  14. Min-Hash Sketches: Example k-partition 6 4 32 7 12 14 part-hash value-hash

  15. Min-Hash Sketches: k-partition k-partition sketch: Use a single hash function: Use the first bits of to map uniformly to one of parts. Call the remaining bits . For : Track the minimum hash value of the elements in part . Processing a new element : Computation: to test or update

  16. Min-Hash Sketches: Example Bottom-k 6 4 32 7 12 14

  17. Min-Hash Sketches: bottom-k Bottom-k sketch: Use a single hash function: Track the smallest values Processing a new element : If: Computation: The sketch is maintained as a sorted list or as a priority queue. • to test if an update is needed • to update a sorted list. to update a priority queue. We will see that #changes #distinct elements

  18. Min-Hash Sketches: Number of updates Claim: The expected number of actual updates (changes) of the min-hash sketch is Proof: First Consider . Look at distinct elements in the order they first occur. The thdistinct element has lower hash value than the current minimum with probability . This is the probability of being first in a random permutation of elements. Total expected number of updates is . 4, 32, 6, 12, 12, 14, 32, 7, 12, 32, 7, Update Prob.

  19. Min-Hash Sketches: Number of updates Claim: The expected number of actual updates (changes) of the min-hash sketch is Proof (continued): Recap for (single min-hash value): the th distinct element causes an update with probability expected total is . k-mins: min-hash values (apply times) Bottom-k: We keep the smallest elements, so update probability of the th distinct element is (probability of being in the first in a random permutation) k-partition: min-hash values for distinct values.

  20. Merging Min-Hash Sketches !! We apply the same set of hash function to all elements/data sets/streams. The union sketch from sketches of two sets ’,: • k-mins: take minimum per hash function • k-partition:take minimum per part • Bottom-k: The smallest in union of data must be in the smallest of their own set:

  21. Using Min-Hash Sketches • Recap: • We defined Min-Hash Sketches (3 types) • Adding elements, merging Min-Hash sketches • Some properties of these sketches • Next: We put Min-Hash sketches to work • Estimating Distinct Count from a Min-Hash Sketch • Tools from estimation theory

  22. The Exponential Distribution • PDF ; CDF ; • Very useful properties: • Memorylessness: • Min-to-Sum conversion: • Relation with uniform:

  23. Estimating Distinct Count from a Min-Hash Sketch: k-mins • Change to exponential distribution • Using Min-to-Sum property, • In fact, we can just work with and use when estimating. • Number of distinct elements becomes a parameter estimation problem: Given independent samples from , estimate

  24. Estimating Distinct Count from a Min-Hash Sketch: k-mins • Each has expectation and variance • The average has expectation and variance The cv is . • is a good unbiased estimator for • But which is the inverse of what we want. What about estimating ?

  25. Estimating Distinct Count from a Min-Hash Sketch: k-mins What about estimating ? • We can use the biased estimator • To say something useful on the estimate quality: We apply Chebyshev’s inequality to bound the probability that is far from its expectation and thus is far from • Maximum Likelihood Estimation (general and powerful technique)

  26. Chebyshev’s Inequality For any random variable with expectation and standard deviation , for any For , For Using

  27. Using Chebyshev’sInequality For =

  28. Maximum Likelihood Estimation Set of independent ; we do not know The MLE is the value that maximizes the likelihood (joint density) function . The maximum over of the probability of observing Properties: • Principled way of deriving estimators • Converges in probability to true value (with enough i.i.d samples)… but generally biased • (Asymptotically!) optimal – minimizes MSE (mean square error) – meets Cramér-Rao lower bound

  29. Estimating Distinct Count from a Min-Hash Sketch: k-mins MLE Given independent samples from , estimate • Likelihood function for (joint density function): • Take a logarithm (does not change the maximum): • Differentiate to find maximum: • MLE estimate We get the same estimator, depends only on the sum!

  30. Given independent samples from , estimate We can think of several ways to combine and use these samples and decrease the variance: • average (sum) • median • remove outliers and average remaining, … We want to get the most value (best estimate) from the information we have (the sketch). What combinations should we consider ?

  31. Sufficient Statistic A function is a sufficient statistic for estimating some function of the parameter if the likelihood function has the factored form Likelihood function (joint density) for exponential i.i.d random variables from : The sum is a sufficient statistic for

  32. Sufficient Statistic A function is a sufficient statistic for estimating some function of the parameter if the likelihood function has the factored form In particular: The MLE depends on only through • The maximum with respect to does not depend on . • The maximum of , computed by deriving with respect to , is a function of T.

  33. Sufficient Statistic is a sufficient statistic for if the likelihood function has the form Lemma: Conditional distribution of given does not depend on If we fix , the density function is If we know the density up to fixed factor, it is determined completely by normalizing to 1

  34. Rao-Blackwell Theorem Recap: is a sufficient statistic for Conditional distribution of given does not depend on Rao-Blackwell Theorem: Given an estimator of that is not a function of the sufficient statistic, we can get an estimator with at most the same MSE that depends only on : • does not depend on (critical) • Process is called: Rao-Blackwellization of

  35. Rao-Blackwell Theorem Density function of given parameter (2,2) (2,1) (1,3) (3,2) (3,0) (4,0) (1,2) (1,4) (3,1)

  36. Rao-Blackwell Theorem Sufficient statistic: T (2,2) (2,1) (1,3) (3,2) (3,0) (4,0) (1,2) (1,4) (3,1)

  37. Rao-Blackwell Theorem Sufficient statistic: T (2,2) (2,1) (1,3) (3,2) (3,0) (4,0) (1,2) (1,4) (3,1)

  38. Rao-Blackwell Theorem Sufficient statistic: T (2,2) (2,1) (1,3) (3,2) (3,0) (4,0) (1,2) (1,4) (3,1)

  39. Rao-Blackwell Theorem Estimator T 1 (2,2) (2,1) (1,3) 3 2 (3,2) 2 (3,0) 0 0 (4,0) 2 (1,2) 1 (1,4) (3,1) 4

  40. Rao-Blackwell Theorem T Rao-Blackwell: 1 (2,2) (2,1) (1,3) 3 2 1 1.5 (3,2) 2 (3,0) 0 0 (4,0) 2 (1,2) 3 1 (1,4) (3,1) 4

  41. Rao-Blackwell Theorem T Rao-Blackwell: • Law of total expectation: Expectation (bias) remains the same • MSE (Mean Square Error) can only decrease

  42. Why does the MSE decrease? • Suppose we have two points with equal probabilities. We have an estimator of that gives estimates and on these points. • We replace it by an estimator that instead returns the average: • The (scaled) contribution of these two points to the square error changes from to

  43. Why does the MSE decrease? Show that

  44. Sufficient Statistic for estimating from k-mins sketches Given independent samples from , estimate • The sum is a sufficient statistic for estimating any function of (including ,) • Rao-Blackwell We can not gain by using estimators with a different dependence on (e.g. functions of median or of a smaller sum)

  45. Estimating Distinct Count from a Min-Hash Sketch: k-mins MLE MLE estimate • , the sum of i.i.drandom variables), has PDF The expectation of the MLE estimate is

  46. Estimating Distinct Count from a Min-Hash Sketch: k-mins Unbiased Estimator (for ) The variance of the unbiased estimate is The CV is Is this the best we can do ?

  47. Cramér-Rao lower bound (CRLB) Are we using the information in the sketch in the best possible way ?

  48. Cramér-Rao lower bound (CRLB) Information theoretic lower bound on the variance of any unbiased estimator of . Likelihood function: Log likelihood: Fisher Information CRLB: Any unbiased estimator has

  49. CRLB for estimating • Likelihood function for • Log likelihood • Negated second derivative: • Fisher information: • CRLB :

  50. Estimating Distinct Count from a Min-Hash Sketch: k-mins Unbiased Estimator (for ) Our estimator has CV The Cramér-Rao lower bound on CV is we are using the information in the sketch nearly optimally !

More Related