1 / 25

# Sublinear Algorihms for Big Data - PowerPoint PPT Presentation

Sublinear Algorihms for Big Data. Lecture 3. Grigory Yaroslavtsev http://grigory.us. SOFSEM 2015. URL: http://www.sofsem.cz 41 st International Conference on Current Trends in Theory and Practice of Computer Science (SOFSEM’15) When and where?

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Sublinear Algorihms for Big Data

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

## SublinearAlgorihms for Big Data

Lecture 3

GrigoryYaroslavtsev

http://grigory.us

### SOFSEM 2015

• URL: http://www.sofsem.cz

• 41st International Conference on Current Trends in Theory and Practice of Computer Science (SOFSEM’15)

• When and where?

• January 24-29, 2015. Czech Republic, Pec pod Snezkou

• August 1st (tomorrow!): Abstract submission

• August 15th: Full papers (proceedings in LNCS)

• I am on the Program Committee ;)

### Today

• Count-Min (continued), Count Sketch

• Sampling methods

• -sampling

• -sampling

• Sparse recovery

• -sparse recovery

• Graph sketching

### Recap

• Stream: elements from universe , e.g.

• = frequency of in the stream = # of occurrences of value

### Count-Min

• are -wise independent hash functions

• Maintain counters with values:

elements in the stream with

• For every the value and so:

• If and then:

.

• Authors: Graham Cormode, S. Muthukrishnan [LATIN’04]

• Count-Min is linear:

Count-Min(S1 + S2) = Count-Min(S1) + Count-Min(S2)

• Deterministic version: CR-Precis

• Count-Min vs. Bloom filters

• Allows to approximate values, not just 0/1 (set membership)

• Doesn’t require mutual independence (only 2-wise)

• FAQ and Applications:

### Fully Dynamic Streams

• Stream: updates that define vector where .

• Example: For

• Count Sketch: Count-Min with random signs and median instead of min:

### Count Sketch

• In addition to use random signs

• Estimate:

• Parameters:

### -Sampling

• Stream: updates that define vector where .

• -Sampling: Return random and :

### Application: Social Networks

• Each of people in a social network is friends with some arbitrary set of other people

• Each person knows only about their friends

• With no communication in the network, each person sends a postcard to Mark Z.

• If Mark wants to know if the graph is connected, how long should the postcards be?

### Optimal estimation

• Yesterday: -approximate

• space for

• space for

• New algorithm: Let be an -sample.

Return , where is an estimate of

• Expectation:

### Optimal estimation

• New algorithm: Let be an -sample.

Return , where is an estimate of

• Variance:

• Exercise: Show that

• Overall:

• Apply average + median to copies

### -Sampling: Basic Overview

• Assume Weight by where

where

• For some value return if there is a unique such that

• Probability is returned if is large enough:

• Probability some value is returned repeat times.

### -Sampling: Part 1

• Use Count-Sketch with parameters to sketch

• To estimate

and

• Lemma: With high probability if

• Corollary: With high probability if and

• Exercise: for large

### Proof of Lemma

• Let

• By the analysis of Count Sketch and by Markov:

• If then

• If , then

where the last inequality holds with probability

• Take median over repetitions high probability

### -Sampling: Part 2

• Let if and otherwise

• If there is a unique with then return

• Note that if then and so

,

therefore

• Lemma: With probability there is a unique such that . If so then

• Thm: Repeat times. Space:

### Proof of Lemma

• Let We can upper-bound

Similarly,

• Using independence of , probability of unique with :

### Proof of Lemma

• Let We can upper-bound

Similarly,

• We just showed:

• If there is a unique i, probability is:

### -sampling

• Maintain and -approximation to

• Hash items using for

• For each , maintain:

• Lemma: At level there is a unique element in the streams that maps to (with constant probability)

• Uniqueness is verified if . If so, then output as the index and as the count.

### Proof of Lemma

• Let and note that

• For any

• Probability there exists a unique such that ,

• Holds even if are only 2-wise independent

### Sparse Recovery

• Goal: Find such that is minimized among s with at most non-zero entries.

• Definition:

• Exercise: where are indices of largest

• Using space we can find such that and

### Count-Min Revisited

• Use Count-Min with

• For let for some row

• Let be the indices with max. frequencies. Let be the event there doesn’t exist with

• Then for

• Because w.h.p. all ’s approx . up to

### Sparse Recovery Algorithm

• Use Count-Min with

• Let be frequency estimates:

• Let be with all but the k-th largest entries replaced by 0.

• Lemma:

• Let be indices corresponding to largest value of and .

• For a vector and denote as the vector formed by zeroing out all entries of except for those in .

+ 2

+ 2

• Questions?