space efficient online computation of quantile summaries
Download
Skip this Video
Download Presentation
Space-Efficient Online Computation of Quantile Summaries

Loading in 2 Seconds...

play fullscreen
1 / 31

Space-Efficient Online Computation of Quantile Summaries - PowerPoint PPT Presentation


  • 99 Views
  • Uploaded on

Space-Efficient Online Computation of Quantile Summaries. SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery. Outline. Introduction The summary data structure Operation and algorithm Tree representation Analysis and experimental result Conclusion. Introduction.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Space-Efficient Online Computation of Quantile Summaries' - pavel


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
space efficient online computation of quantile summaries

Space-Efficient Online Computation of Quantile Summaries

SIGMOD 01

Michael Greenwald & Sanjeev Khanna

Presented by ellery

outline
Outline
  • Introduction
  • The summary data structure
  • Operation and algorithm
  • Tree representation
  • Analysis and experimental result
  • Conclusion
introduction
Introduction
  • Space-efficient computation of quantile summaries of very large data sets in a single pass.
  • Quantile queries: Given a quantile, , return the value whose rank is N
slide4

N = 16

sorting

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 11, 11, 11, 12

0.5 quantile returns element ranked 8 ( 0.5*16)

which is 8

0.75 quantile returns element ranked 12 (0.75*16)

which is 10

requirements
Requirements
  • Explicit & tunable a priori guarantees on the precision of the approximation
  • As small a memory footprint as possible
  • Online:Single pass over the data
  • Data Independent Performance: guarantees should be unaffected by arrival order, distribution of values, or cardinality of observations.
  • Data Independent Setup: no a priori knowledge required about data set (size, range, distribution, order).
approximate
ε- approximate
  • A quantile summary for a data sequence is ε- approximate if, for any given rank r, it returns a value whose rank r’ is guaranteed to be within the interval [r -εN , r + εN ]

Example : A data stream with 100 elements,

0.5 – quantile with ε= 0.1 returns a value v.

The true rank of v is within [40,60]

the summary data structure
The Summary Data Structure
  • Let rmin(v) and rmax(v) denote the lower and upper bounds on the rank of v
  • Each tuple ti = (vi , gi ,Δi)
example
Example

.01, N=1750

{28,7}

{10,1}

{15,2}

192

204

201

[501,503]

[539,540]

[529,536]

query
Query
  • Sketch S isε- approximate, That is for each ψ (0,1] , there is a (vi , rmin(vi), rmax(vi)) in S such that
  • vi is our answer for ψ-quantile
corollary
Corollary
  • If at any time n, the summary S(n) satisfies the property that

then we can answer any ψ-quantile query to within an εn precision.

overview of summary data structure
Overview of Summary Data Structure

 = .29

r = N = 522

.01, N=1800

{28,7}

{15,2}

{10,1}

  • Quantile  = .29? Compute r and choose best vi

192

201

204

[529,536]

[539,540]

[501,503]

overview of summary data structure1
Overview of Summary Data Structure

.01, N=1800

{28,7}

{15,2}

{10,1}

  • If (rmax(vi+1) - rmin(vi)) ≦ 2N, then -approximate summary.
  • Our goal: always maintain this property.
  • Tuple formulation of this rule: gi + I ≦ 2N

2N=36

192

204

201

[529,536]

[539,540]

[501,503]

overview of summary data structure2

197

Overview of Summary Data Structure

.01, N=1800

{28,7}

{15,2}

{10,1}

  • Goal: always maintain -approximate summary(rmax(vi+1) - rmin(vi)) = (gi + I) ≦ 2N
  • Insert new observations into summary

2N=36

192

204

201

[539,540]

[529,536]

[501,503]

overview of summary data structure3
Overview of Summary Data Structure

.01, N=1800

{28,7}

{15,2}

{10,1}

  • Goal: always maintain -approximate summary(rmax(vi+1) - rmin(vi)) = (gi + I) ≦ 2N
  • Insert new observations into summary

2N=36

197

192

204

201

[502,536]

[501,503]

[529,536]

[539,540]

overview of summary data structure4
Overview of Summary Data Structure

.01, N=1801

{28,7}

{15,2}

{1,34}

{10,1}

  • Goal: always maintain -approximate summary (rmax(vi+1) - rmin(vi)) = (gi + I) ≦ 2N
  • Insert new observations into summary
    • Insert tuple before the ith tuple. gnew = 1; new = gi + I - 1;

2N=36.02

197

192

204

201

[502,536]

[530,537]

[540,541]

[501,503]

overview of summary data structure5
Overview of Summary Data Structure

.01, N=1801

{28,7}

{15,2}

{1,34}

{10,1}

  • Goal: always maintain -approximate summary (rmax(vi+1) - rmin(vi)) = (gi + I) ≦ 2N
  • Insert new observations into summary
  • Delete all “superfluous” entries.

2N=36.02

197

192

204

201

[502,536]

[540,541]

[530,537]

[501,503]

overview of summary data structure6
Overview of Summary Data Structure

.01, N=1801

{28,7}

{15,2}

{1,34}

{10,1}

  • Goal: always maintain -approximate summary (rmax(vi+1) - rmin(vi)) = (gi + I) ≦ 2N
  • Insert new observations into summary
  • Delete all “superfluous” entries.

2N=36.02

192

204

201

[530,537]

[540,541]

[501,503]

overview of summary data structure7
Overview of Summary Data Structure

.01, N=1801

{29,7}

{15,2}

{10,1}

  • Goal: always maintain -approximate summary (rmax(vi+1) - rmin(vi)) = (gi + I) ≦ 2N
  • Insert new observations into summary
  • Delete all “superfluous” entries. gi = gi + gi-1

2N=36.02

192

204

201

[530,537]

[540,541]

[501,503]

overview of summary data structure8
Overview of Summary Data Structure

.01, N=1801

{29,7}

{15,2}

{10,1}

2N=36.02

  • Insert: gnew = 1; new = gi + I - 1;
  • Delete: gi = gi + gi-1

192

204

201

[530,537]

[540,541]

[501,503]

terminology
Terminology
  • Full tuple: A tuple is full if gi + I = 2N
  • Full tuple pair: A pair of tuples is full if deleting the left-hand tuple would overfill the right one
  • Capacity: number of observations that can be counted by gi before the tuple becomes full. (=2N - I)

General strategy will be to delete tuples with small capacity and preserve tuples with large capacity.

operations
Operations
  • Insert(v):Find the smallest i, such that

, and insert

  • Delete(vi):to delete from S, replace and by the new tuple
  • Compress():from right to left, merge all mergeable pair.
gk algorithm
GK Algorithm

To add the n+1st observation, v, to summary S(n)

yes

no

COMPRESS()

INSERT

tree representation
Tree Representation

-range Capacity Band0-7 8-15 38-11 4-7 212-13 2-3 114 1 0

.001, N=7,000

2N=14

  • Group tuples with similar capacities into bands
  • First (least index) node to the right with higher capacity band becomes parent.

0

0

0

3

1

2

1

1

1

0

3

0

1

2

3

1

2

0

1

1

3

tree representation1
Tree Representation

-range Capacity Band0-7 8-15 38-11 4-7 212-13 2-3 114 1 0

.001, N=7,000

2N=14

  • Group tuples with similar capacities into bands
  • First (least index) node to the right with higher capacity band becomes parent.

3

3

3

3

0

0

0

1

2

1

1

1

0

0

1

2

1

2

0

1

1

tree representation2

3

3

3

3

2

2

2

0

0

0

1

1

1

1

0

0

1

1

0

1

1

Tree Representation

-range Capacity Band0-7 8-15 38-11 4-7 212-13 2-3 114 1 0

.001, N=7,000

  • Group tuples with similar capacities into bands
  • First (least index) node to the right with higher capacity band becomes parent.

2N=14

tree representation3

R

3

3

3

3

2

2

2

1

1

1

1

1

1

1

1

0

0

0

0

0

0

Tree Representation

-range Capacity Band0-7 8-15 38-11 4-7 212-13 2-3 114 1 0

.001, N=7,000

2N=14

  • Group tuples with similar capacities into bands
  • First (least index) node to the right with higher capacity band becomes parent.
operation compress
Operation (compress)

General strategy: delete tuples with small capacity and preserve tuples with large capacity.

1) Deletion cannot leave descendants unmerged --- it must delete entire subtrees

2) Deletion can only merge a tuple with small capacity into a tuple with similar or larger capacity.

3) Deletion cannot create an over-full tuple (i.e with g+ > floor(2N))

analysis
Analysis
  • Theorem

At any time n, the total number of tuples stored in S(n) is at most

experimental result
Experimental Result
  • Measurement:
    • |S|
    • Observed  (vs. desired ) : max, avg, and for 16 representative quantiles
    • Optimal max observed 
  • Compared 3 algorithms
    • MRL
    • Preallocated (1/3 number of stored observations as MRL)
    • Adaptive: allocate a new quantile only when observed error is about to exceed desired 
conclusion
Conclusion
  • Better worst-case behavior than previous algorithms
  • It does not require a priori knowledge of the parameter N
ad