Sublinear algorihms for big data
This presentation is the property of its rightful owner.
Sponsored Links
1 / 25

Sublinear Algorihms for Big Data PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Sublinear Algorihms for Big Data. Lecture 3. Grigory Yaroslavtsev SOFSEM 2015. URL: 41 st International Conference on Current Trends in Theory and Practice of Computer Science (SOFSEM’15) When and where?

Download Presentation

Sublinear Algorihms for Big Data

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Sublinear algorihms for big data

SublinearAlgorihms for Big Data

Lecture 3


Sofsem 2015


  • URL:

  • 41st International Conference on Current Trends in Theory and Practice of Computer Science (SOFSEM’15)

  • When and where?

    • January 24-29, 2015. Czech Republic, Pec pod Snezkou

  • Deadlines:

    • August 1st (tomorrow!): Abstract submission

    • August 15th: Full papers (proceedings in LNCS)

  • I am on the Program Committee ;)



  • Count-Min (continued), Count Sketch

  • Sampling methods

    • -sampling

    • -sampling

  • Sparse recovery

    • -sparse recovery

  • Graph sketching



  • Stream: elements from universe , e.g.

  • = frequency of in the stream = # of occurrences of value

Count min


  • are -wise independent hash functions

  • Maintain counters with values:

    elements in the stream with

  • For every the value and so:

  • If and then:


More about count min

More about Count-Min

  • Authors: Graham Cormode, S. Muthukrishnan [LATIN’04]

  • Count-Min is linear:

    Count-Min(S1 + S2) = Count-Min(S1) + Count-Min(S2)

  • Deterministic version: CR-Precis

  • Count-Min vs. Bloom filters

    • Allows to approximate values, not just 0/1 (set membership)

    • Doesn’t require mutual independence (only 2-wise)

  • FAQ and Applications:



Fully dynamic streams

Fully Dynamic Streams

  • Stream: updates that define vector where .

  • Example: For

  • Count Sketch: Count-Min with random signs and median instead of min:

Count sketch

Count Sketch

  • In addition to use random signs

  • Estimate:

  • Parameters:



  • Stream: updates that define vector where .

  • -Sampling: Return random and :

Application social networks

Application: Social Networks

  • Each of people in a social network is friends with some arbitrary set of other people

  • Each person knows only about their friends

  • With no communication in the network, each person sends a postcard to Mark Z.

  • If Mark wants to know if the graph is connected, how long should the postcards be?

Optimal estimation

Optimal estimation

  • Yesterday: -approximate

    • space for

    • space for

  • New algorithm: Let be an -sample.

    Return , where is an estimate of

  • Expectation:

Optimal estimation1

Optimal estimation

  • New algorithm: Let be an -sample.

    Return , where is an estimate of

  • Variance:

  • Exercise: Show that

  • Overall:

    • Apply average + median to copies

Sampling basic overview

-Sampling: Basic Overview

  • Assume Weight by where


  • For some value return if there is a unique such that

  • Probability is returned if is large enough:

  • Probability some value is returned repeat times.

Sampling part 1

-Sampling: Part 1

  • Use Count-Sketch with parameters to sketch

  • To estimate


  • Lemma: With high probability if

  • Corollary: With high probability if and

  • Exercise: for large

Proof of lemma

Proof of Lemma

  • Let

  • By the analysis of Count Sketch and by Markov:

  • If then

  • If , then

    where the last inequality holds with probability

  • Take median over repetitions high probability

Sampling part 2

-Sampling: Part 2

  • Let if and otherwise

  • If there is a unique with then return

  • Note that if then and so



  • Lemma: With probability there is a unique such that . If so then

  • Thm: Repeat times. Space:

Proof of lemma1

Proof of Lemma

  • Let We can upper-bound


  • Using independence of , probability of unique with :

Proof of lemma2

Proof of Lemma

  • Let We can upper-bound


  • We just showed:

  • If there is a unique i, probability is:



  • Maintain and -approximation to

  • Hash items using for

  • For each , maintain:

  • Lemma: At level there is a unique element in the streams that maps to (with constant probability)

  • Uniqueness is verified if . If so, then output as the index and as the count.

Proof of lemma3

Proof of Lemma

  • Let and note that

  • For any

  • Probability there exists a unique such that ,

  • Holds even if are only 2-wise independent

Sparse recovery

Sparse Recovery

  • Goal: Find such that is minimized among s with at most non-zero entries.

  • Definition:

  • Exercise: where are indices of largest

  • Using space we can find such that and

Count min revisited

Count-Min Revisited

  • Use Count-Min with

  • For let for some row

  • Let be the indices with max. frequencies. Let be the event there doesn’t exist with

  • Then for

  • Because w.h.p. all ’s approx . up to

Sparse recovery algorithm

Sparse Recovery Algorithm

  • Use Count-Min with

  • Let be frequency estimates:

  • Let be with all but the k-th largest entries replaced by 0.

  • Lemma:

Sublinear algorihms for big data

  • Let be indices corresponding to largest value of and .

  • For a vector and denote as the vector formed by zeroing out all entries of except for those in .

    + 2

    + 2

Thank you

Thank you!

  • Questions?

  • Login