1 / 28

Algorithms for Large Data Sets

Algorithms for Large Data Sets. Ziv Bar-Yossef. Lecture 13 June 25, 2006. http://www.ee.technion.ac.il/courses/049011. Data Streams (cont.). Outline. Distinct elements L p norms Notation : for integers a < b, [a,b] = {a, a+1, …, b}.

hedwig
Download Presentation

Algorithms for Large Data Sets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006 http://www.ee.technion.ac.il/courses/049011

  2. Data Streams (cont.)

  3. Outline • Distinct elements • Lp norms • Notation: for integers a < b, [a,b] = {a, a+1, …, b}

  4. Distinct Elements[Flajolet, Martin 85] [Alon, Matias, Szegedy 96], [Bar-Yossef, Jayram, Kumar, Sivakumar, Trevisan 02] • Input: a vector x  [1,m]n • Goal: find D = number of distinct elements of x • Exact algorithms: need (m) bits of space • Deterministic algorithms: need (m) bits of space • Approximate randomized algorithms: O(log m) bits of space

  5. Distinct Elements, 1st Attempt • Let M >> m2 • Pick a “random hash function” h: [1,m]  [1,M] • h(1),…,h(m) are chosen uniformly and independently from [1,M] • Since M >> m2, probability of collisions is tiny • min  M • for i = 1 to n do • read xi from stream • if h(xi) < min, min  h(xi) • output M/min

  6. Distinct Elements: Analysis • Space: • O(log M) = O(log m) for min • O(m log M) = O(m log m) for h • Too much! • Worse than the naïve O(m) space algorithm • Next: show how to use more “space-efficient” hash functions

  7. Small Families of Hash Functions • H = {h | h: [1,m]  [1,M] }: a family of hash functions • |H| = O(mc) for some constant c • Therefore, each h  H can be represented in O(log m) bits • Need H to be “explicit”: given representation of h, can compute h(x), for any x, efficiently. • How do we make sure H has the “random-like” properties of random hash functions?

  8. Universal Hash Functions[Carter, Wegman 79] • H is a 2-universal family of hash functions if: For all x  y  [1,m] and for all z,w  [1,M], when choosing h from H randomly, then Pr[h(x) = z and h(y) = w] = 1/M2 • Conclusions: • For each x, h(x) is uniform in [1,M] • For all x  y, h(x) and h(y) are independent • h(1),…,h(m) is a sequence of uniform pairwise-independent random variables • k-universal families: straightforward generalization

  9. Construction of a Universal Family • Suppose M = prime power • [1,M] can be viewed as a finite field FM • [1,m] can be viewed as elements of FM • H = { ha,b | a,b  FM } is defined as: ha,b(x) = ax + b • Note: • |H| = M2 • If x  y  FM and z,w  Fm, then ha,b(x) = z and ha,b(y) = w iff • Since x  y, the above system has a unique solution • Hence, Pra,b[ha,b(x) = z and ha,b(y) = w] = 1/M2.

  10. Distinct Elements, 2nd Attempt • Use 2-universal hash functions rather than random hash function • Space: • O(log m) for tracking the minimum • O(log m) for storing the hash function • Correctness: • Part 1: • h(a1),…,h(aD) are still uniform in [1,M] • Linearity of expectation holds regardless of whether Z1,…,Zk are independent or not. • Part 2: • h(a1),…,h(aD) are still uniform in [1,M] • Main point: variance of pairwise independent variables is additive:

  11. Distinct Elements, Better Approximation • So far we had a factor 6 approximation. • How do we get a better one? • 1 +  approximation algorithm: • Find the t = O(1/2) smallest elements, rather than just the smallest one. • If v is the largest among these, output tM/v • Space: O(1/2 log m) • Better algorithm: O(1/2 + log m)

  12. Lp Norms • Input: an integer vector x  [-m,+m]n • Goal: find ||x||p = Lp norm of x • Popular instantiations: • L2: Euclidean distance • L1: Manhattan distance • L: max • L0: # of non-zeros (assuming 1/0 = 1, 00 = 0) • Not a norm • Data stream algorithm: • Can be done trivially in O(log m) space

  13. Lp Norms: The “Cash Register” Model • Input: a sequence X of N pairs (i1,a1),…,(iN,aN) • For each j, ij {1,…,n} • For each j, aj  [-m,m] • Ex: X = (1,3), (3,-2), (1,-5), (2,4), (2,1) • For each i = 1,…,n, let Si = { j | ij = i } • Ex: S1 = {1,3}, S2 = {4,5}, S3 = {2} • Define: xi = jSi aj • Ex: x1 = -2, x2 = 5, x3 = -2 • Goal: find ||x||p = Lp norm of x

  14. Lp Norms in the “Cash Register” Model: Applications • Standard Lp norms • Lp distances • Input: two vectors x,y  [-m,+m]n (interleaved arbitrarily) • Goal: find ||x – y||p • Frequency moments: • Input: a vector X  [1,n]N • Ex: X = (1 2 3 1 1 2) • For each i = 1,…,n, define: xi = frequency of i in X • Ex: x1 = 3, x2 = 2, x3 = 1 • Goal: output ||x||p • Special cases: • p = : Most frequent element • p = 0: Distinct elements

  15. Lp Norms: State of the Art Results • 0 < p ≤ 2: O(log n log m) space algorithm [Indyk 00] • 2 <p < : O(n1-2/p log m) space algorithm [Indyk,Woodruff 05] • (n1-2/p-o(1)) space lower bound[Saks, Sun 02], [Bar-Yossef,Jayram,Kumar,Sivakumar 02], [Chakrabarti, Khot, Sun 03] • p = : O(n) space algorithm [Alon,Matias,Szegedy 96] • (n) space lower bound[Alon,Matias,Szegedy 96] • p = 0 (distinct elements): O(log n + 1/2) space algorithm [Bar-Yossef,Jayram,Kumar,Sivakumar,Trevisan 02] • (log n + 1/2) space lower bound[Alon,Matias,Szegedy 96], [Indyk, Woodruff 03]

  16. Stable Distributions • D: distribution on R, x Rn, p (0,2] • The distribution Dx: • Z1,…,Zn: i.i.d. random variables with distribution D • Dx = distribution of i xi Zi • The distribution Dp,x: • Z: random variable with distribution D • Dp,x = distribution of ||x||p Z • Definition: D is p-stable, if for every x, Dx = Dp,x. • Examples: • p = 2: Standard normal distribution. • p = 1: Cauchy distribution. • Other p’s: no closed form pdf.

  17. Indyk’s Algorithm • For simplicity, assume p = 1. • Input: a sequence X = (i1,a1),…,(iN,aN) • Output: a value z s.t. • “Cauchy hash function”: h:[1,n]  R • h(1),…,h(n) are i.i.d. with Cauchy distribution • In practice, use bounded precision

  18. Indyk’s Algorithm, 1st Attempt • k  O(1/2 log(1/)) • generate k Cauchy hash functions h1,…,hk • for t = 1,…,k do • At  0 • for j = 1,…,N do • read (ij,aj) from data stream • for t = 1,…,k do • At  At + aj ht(ij) • output median(A1,…,Ak)

  19. Correctness Analysis • Fix some t  [1,k] • What value does At have at the end of the execution? • Recall: ht(1),…,ht(n) are i.i.d. with 1-stable distribution • Therefore, At is distributed the same as: ||x||1 Z • Z: random variable with Cauchy distribution

  20. Correctness Analysis (cont.) • Z1,…,Zk: i.i.d. random variables with Cauchy distribution • Output of algorithm: median(A1,…,Ak) • Same as: median(||x||1 Z1,…,||x||1 Zk) = ||x||1 median(Z1,…,Zk) • Conclusion: enough to show:

  21. Correctness Analysis (cont.) • Claim: Let Z be distributed Cauchy. Then, • Proof: The cdf of the Cauchy distribution is: • Therefore, • Claim: Let Z be distributed Cauchy. For any sufficiently small  > 0,

  22. Correctness Analysis (cont.) • Claim: Let Z1,…,Zk be k = O(1/2 log(1/)) i.i.d. Cauchy random variables. Then, • Proof: • For j = 1,…,k, let • Then, median(Z1,…,Zk) < 1 -  iff jYj ≥ k/2 • E[jYj] = k/2 - k/4 • By Chernoff-Heoffding bound, Pr[jYj ≥ k/2] < /2 • Similar analysis shows: Pr[median(Z1,…,Zk) > 1 + ] < /2

  23. Space Analysis • Space used: k = O(1/2 log(1/)) times: • At: O(log m) bits • ht: O(n log m) bits • Too much! • This time we really need ht(1),…,ht(n) to be totally independent • Otherwise, resulting distribution is not stable • Cannot use universal hashing • What can we do?

  24. Pseudo-Random Generators for Space-Bounded Computations [Nisan 90] • Notation: Uk = a random sequence of k bits • An S-space R-random bits randomized algorithm A: • Uses at most S bits of space • Uses at most R random bits • Accesses random bits sequentially • A(x,UR): (random) output of A on input x • Nisan’s pseudo-random generator: G: {0,1}S log R {0,1}R s.t. • For every S-space R-random bits randomized algorithm A, • for every input x, • A(x,UR) has almost the same distribution as A(x,G(US log R))

  25. Space Analysis • Suppose input stream is guaranteed to come in the following order: • First all pairs of the form (1,*) • Then, all pairs of the form (2,*), • … • Finally, all pairs of the form (n,*) • Then, we can generate the values ht(1),…,ht(n) on the fly, and no need to store them • O(log m) bits will suffice to store the hash function • Therefore, for such input streams, Indyk’s algorithm uses: • O(log m) bits of space • O(n log m) random bits

  26. Space Analysis (cont.) • Conclusion: For “ordered” input streams, Indyk’s algorithm is an O(log m)-space O(n log m)-random bits randomized algorithm. • Can use Nisan’s generator • ht can now be generated from only O(log m log n) random bits • Space needed: O(log n log m) bits • Crucial observation: Indyk’s algorithm does not depend on the order of the input stream. • Conclusion: If we generate the Cauchy hash functions using Nisan’s generator, then Indyk’s algorithm will work even for “unordered” streams.

  27. Wrapping Up • Space used: k = O(1/2 log(1/)) times: • At: O(log m) bits • ht: O(log n log m) bits (using Nisan’s generator) • Total: O(1/2 log(1/) log n log m) bits

  28. End of Lecture 13

More Related