1 / 24

Lower bounds on data stream computations

Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld. Lower bounds on data stream computations. Previously. We proved 3 theorems concerning space complexity of data stream algorithms.

Download Presentation

Lower bounds on data stream computations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld Lower bounds on data streamcomputations

  2. Previously... • We proved 3 theorems concerning space complexity of data stream algorithms. • Using the streaming model discussed earlier, we found out some lower bounds for the MAX, MAXNEIGHBOR, MAXTOTAL and MAXPATH algorithms. • And now, for something completely different.

  3. Today • In this lecture, I introduce lower bounds from communication complexity. • Trust me they are correct. • Using these bounds and (mostly) reductions, our goal is to prove even more theorems. Theorems are good. • I'll prove 3 of them. • Starting with “Theorem 4”.

  4. Theorem 4 • Setting: Sequence of m numbers in {1,...,n}. • Multiple occurences are allowed. • Claim: Finding the k most frequent items requires Ω(n/k) space. • Moreover, random sampling yields an upper bound of O(n (log m + log n) / k). • We're going to use a blackbox to prove it.

  5. Theorem 4 blackbox • Alon-Matias-Szegedy: Finding the most frequent number in a sequence of length m in range {1,...,n} takes Ω(n) space. • Proof outline: Reduction. Namely, we create a new stream that we can (ab)use this blackbox on. • The reduction will replace each number in the sequence with a sequence of numbers: • Each i in {1,...,n} is replaced with ki+1,...,ki+k. • In total, nk numbers.

  6. Reduction example • Our data stream is {4,5,3,2,7,3,4,5,1} in range {1,...,10} and we want to obtain the 2 most occuring numbers. • The reduction will create the numbers: • {9,10}, {11, 12}, {7, 8}, {5, 6}, {15,16}, {7, 8}, {9, 10}, {11,12}, {3, 4} • The most occuring numbers in the original sequence are the most occuring number in the new sequence.

  7. Proof outline • If xi=xj, then the sequences created by the reduction coincide. Otherwise, they are disjoint. • If xi occurs l times in the stream, it'll occur kl times in the new stream. • It follows that finding one of the k most frequent items in one pass requires Ω(n/k) space. Running this 'algorithm' k times we get the AMS theorem. • Great success.

  8. As for the upper bound • Reminder: a Monte-Carlo algorithm is a randomized algorithm that succeeds with a high probability. • So we'll show a Monte-Carlo algorithm that succeeds with high probability to get the right upper bound.

  9. The Monte-Carlo algorithm • Before reading the stream: • Sample each number with probability 1/k. • Only keep a counter for the sampled numbers. • Read the stream normally. • Output the successfully sampled number with largest count. • With constant probability, one of the k-th most frequent numbers has been sampled successfully. • This requires O(n (log m + log n) / k) space. Epic win.

  10. And now for somethingcompletely different • Introducing the approximate median problem (AMP). • Reminder: The median is the value which separates the higher half of the set from the lower half. • We want to approximate that. Why? Because it's cool.

  11. This slide isn't the median problem • First, a blackbox from communication complexity. • Consider the bit-vector probing problem: • Let A have a bit sequence of length m and B an index i. B needs to know xi, the i-th input bit. • But the communication is one way only, B can not send anything to A. • Ideas?

  12. Blackbox cont. • Turns out there isn't a better method for A to send the i-th bit than to send the entire string to B. • So it takes Ω(m) space. • But what about randomization? • Too bad, any algorithm that succeeds in guessing xi • With probability better than (1+ε)/2 • Requires at least εm bits of communication.

  13. Approximate median problem • Goal: Find a number whose rank is in the interval [m/2 – εm, m/2 + εm]. • It can be solved by a one-pass Monte-Carlo algorithm with 1/10 error probability. • Takes O(log n (log 1/ε)2 / ε) space. • I have a truly magnificent proof of this theorem. This slideshow is too small to contain it.

  14. AMP cont. • Motivation: We want to prove a corresponding lower bound on this problem. • How: We show that any 1-pass Las Vegas algorithm that solves the ε-AMP requires Ω(1/ε) space. • We show a reduction from the bit-vector probing problem.

  15. AMP lower bound proof • Let B be a bit vector, followed by a query index i. • This is translated to a sequence of numbers as follows: • First, output 2j+bj, for each j. • Then, upon getting the query, output n-i+1 copies of 0 and i+1 copies of 2(n+1).

  16. Reduction example • B = (0,1,0,1,1,0,1,1,0,1), i=5. • The reduction maps: • 2j+bj: [2,5,6,9,11,12,15,17,18,21] • N-i+1=6 copies of 0: [0,0,0,0,0,0] • i+1=6 copies of 22=2(n+1): [22,22,22,22,22,22] • The median of this set is 11. It's LSB is 1. Which is exactly the value of b5.

  17. AMP proof cont. • It is easily verified that the least significant bit of the median of this sequence is the value of bi (that is, the bit we seek). • Choose ε=1/2n. Therefore the ε-approximate median is the exact median. This is true because we have 2n numbers in the “reduced” stream. • Therefore any one-pass algorithm that requires fewer than 1/2ε = n bits of memory can be used...

  18. AMP proof cont. • … to derive a communication protocol that requires fewer than n bits to be communicated from A to B in solving bit vector probing. • But every protocol that solves bit vector probing must communicate n bits. • Contradiction. Quod erat demonstratum.

  19. Corollary • What's the point I've been trying to make? • Randomization can sometimes reduce space complexity significantly, at the cost of guarantee of output correctness. • Moving right along.

  20. Some graph theory • A graph can be considered as a stream. • Example: Adjacency list. • This means some graph-theoretic problems can be approximated or solved using data stream and communication complexity techniques. • I'll address a small part of them.

  21. Why is this good? • Suppose we can read the stream more than once (we don't have enough memory to store it but we do have access). • But the amount of times we can read the stream is finite. • What possible graph theoretic problems could we approximate with this method?

  22. Theorem 6 • In P passes, the following problems on an n-node graph take Ω(n / P) space: • Computing connected components • Computing k-edge connected components. • Computing k-vertex connected components. • Testing graph planarity. • Finding the sinks of a directed graph. • I'll prove graph connectivity.

  23. Connected components • Proof by reduction of DISJOINT to the graph connectivity problem. Reminder: DISJOINT(x,y) returns 1 iff there exists i such that xi=yi. • Given bit vectors A and B, construct a graph with vertices {a,b,1,...,n}. • Insert an edge (a,i) iff i is in A's vector and an edge (i,b) iff it's in B's vector. • The graph is connected iff there exists a bit that's set in both vectors.

  24. Connectivity cont. • From communication complexity, we know that every DISJOINT-solving protocol sends Ω(n) bits. • So if we have P passes over the data, one of the passes must use Ω(n / P) space. This is a total cheating hack by the way. Blame HRR. • QED anyway. • That's all folks!

More Related