- 92 Views
- Uploaded on
- Presentation posted in: General

Sublinear Algorihms for Big Data

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

SublinearAlgorihms for Big Data

Lecture 3

GrigoryYaroslavtsev

http://grigory.us

- URL: http://www.sofsem.cz
- 41st International Conference on Current Trends in Theory and Practice of Computer Science (SOFSEM’15)
- When and where?
- January 24-29, 2015. Czech Republic, Pec pod Snezkou

- Deadlines:
- August 1st (tomorrow!): Abstract submission
- August 15th: Full papers (proceedings in LNCS)

- I am on the Program Committee ;)

- Count-Min (continued), Count Sketch
- Sampling methods
- -sampling
- -sampling

- Sparse recovery
- -sparse recovery

- Graph sketching

- Stream: elements from universe , e.g.
- = frequency of in the stream = # of occurrences of value

- are -wise independent hash functions
- Maintain counters with values:
elements in the stream with

- For every the value and so:
- If and then:
.

- Authors: Graham Cormode, S. Muthukrishnan [LATIN’04]
- Count-Min is linear:
Count-Min(S1 + S2) = Count-Min(S1) + Count-Min(S2)

- Deterministic version: CR-Precis
- Count-Min vs. Bloom filters
- Allows to approximate values, not just 0/1 (set membership)
- Doesn’t require mutual independence (only 2-wise)

- FAQ and Applications:
- https://sites.google.com/site/countminsketch/home/
- https://sites.google.com/site/countminsketch/home/faq

- Stream: updates that define vector where .
- Example: For
- Count Sketch: Count-Min with random signs and median instead of min:

- In addition to use random signs
- Estimate:
- Parameters:

- Stream: updates that define vector where .
- -Sampling: Return random and :

- Each of people in a social network is friends with some arbitrary set of other people
- Each person knows only about their friends
- With no communication in the network, each person sends a postcard to Mark Z.
- If Mark wants to know if the graph is connected, how long should the postcards be?

- Yesterday: -approximate
- space for
- space for

- New algorithm: Let be an -sample.
Return , where is an estimate of

- Expectation:

- New algorithm: Let be an -sample.
Return , where is an estimate of

- Variance:
- Exercise: Show that
- Overall:
- Apply average + median to copies

- Assume Weight by where
where

- For some value return if there is a unique such that
- Probability is returned if is large enough:
- Probability some value is returned repeat times.

- Use Count-Sketch with parameters to sketch
- To estimate
and

- Lemma: With high probability if
- Corollary: With high probability if and
- Exercise: for large

- Let
- By the analysis of Count Sketch and by Markov:
- If then
- If , then
where the last inequality holds with probability

- Take median over repetitions high probability

- Let if and otherwise
- If there is a unique with then return
- Note that if then and so
,

therefore

- Lemma: With probability there is a unique such that . If so then
- Thm: Repeat times. Space:

- Let We can upper-bound
Similarly,

- Using independence of , probability of unique with :

- Let We can upper-bound
Similarly,

- We just showed:
- If there is a unique i, probability is:

- Maintain and -approximation to
- Hash items using for
- For each , maintain:
- Lemma: At level there is a unique element in the streams that maps to (with constant probability)
- Uniqueness is verified if . If so, then output as the index and as the count.

- Let and note that
- For any
- Probability there exists a unique such that ,
- Holds even if are only 2-wise independent

- Goal: Find such that is minimized among s with at most non-zero entries.
- Definition:
- Exercise: where are indices of largest
- Using space we can find such that and

- Use Count-Min with
- For let for some row
- Let be the indices with max. frequencies. Let be the event there doesn’t exist with
- Then for
- Because w.h.p. all ’s approx . up to

- Use Count-Min with
- Let be frequency estimates:
- Let be with all but the k-th largest entries replaced by 0.
- Lemma:

- Let be indices corresponding to largest value of and .
- For a vector and denote as the vector formed by zeroing out all entries of except for those in .
+ 2

+ 2

- Questions?