- 194 Views
- Uploaded on
- Presentation posted in: General

Maintaining Variance and k-Medians over Data Stream Windows

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Maintaining Variance and k-Medians over Data Stream Windows

Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan

Stanford University

- Streaming data model
- Useful for applications with high data volumes, timeliness requirements
- Data processed in single pass
- Limited memory (sublinear in stream size)

- Sliding window model
- Variation of streaming data model
- Only recent data matters
- Parameterized by window size N
- Limited memory (sublinear in window size)

Time Increases

….1 0 1 0 0 0 1 0 1 1 1 1 1 1 0 0 0 1 0 1 0 0 1 1…

Window Size N = 7

Current Time

- Variance: Σ(xi – μ)2, μ = Σ xi/N
- k-median clustering:
- Given: N points (x1… xN) in a metric space
- Find k points C = {c1, c2, …, ck} that minimize Σ d(xi, C) (the assignment distance)

- Count of non-zero elements /Sum of positive integers [DGIM’02]
- (1 ± ε) approximation
- Space: θ((1/ε)(log N)) words θ((1/ε)(log2 N)) bits
- Update time: θ(log N) worst case, θ(1) amortized
- Improved to θ(1) worst case by [GT’02]

- Exponential Histogram (EH) data structure

- Generalized SW model [CS’03] (previous talk)

- (1 ± ε) approximation
- Space: O((1/ε2) log N) words
- Update Time: O(1) amortized, O((1/ε2) log N) worst case

- 2O(1/τ)approximation of assignment distance (0 < τ < ½)
- Space: O((k/τ4)N2τ)
- Update time: O(k) amortized, O((k2/τ3)N2τ) worst case
- Query time: O((k2/τ3)N2τ)

~

~

~

~

- Overview of Exponential Histogram
- Where EH fails and how to fix it
- Algorithm for Variance
- Main ideas in k-medians algorithm
- Open problems

- Main difficulty: discount expiring data
- As each element arrives, one element expires
- Value of expiring element can’t be known exactly
- How do we update our data structure?

- One solution: Use histograms

….1 1 0 1 1 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0 …

Bucket Sums = {2,1,2}

Bucket Sums = {3,2,1,2}

- Error comes from last bucket
- Need to ensure that contribution of last bucket is not too big

- Bad example:

… 1 1 0 0 1 1 1 1 1 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0…

Bucket Sums = {4,4,4}

Bucket Sums = {4}

- Exponential Histogram algorithm:
- Initially buckets contain 1 item each
- Merge adjacent buckets once the sum of later buckets is large enough

Bucket sums = {4, 2, 2, 1, 1}

Bucket sums = {4, 2, 2, 1, 1 ,1}

Bucket sums = {4, 2, 2, 1}

Bucket sums = {4, 2, 2, 2, 1}

Bucket sums = {4, 4, 2, 1}

….1 1 0 1 1 1 0 1 0 1 0 0 1 0 1 1 1 1…

- [DGIM’02] Can estimate any function f defined over windows that satisfies:
- Positive:f(X) ≥ 0
- Polynomially bounded:f(X) ≤ poly(|X|)
- Composable: Can compute f(X +Y) from f(X), f(Y) and little additional information
- Weakly Additive:(f(X) + f(Y)) ≤ f(X +Y) ≤ c(f(X) + f(Y))

- “Weakly Additive” condition not valid for variance, k-medians

Current window, size = N

………………

Bm-1

Bm

B2

B1

Vi = Variance of the ith bucket

ni = number of elements in ith bucket

μi = mean of the ith bucket

- Bi,j = concatenation of buckets i and j

Variance of each bucket is small

Cannot afford to neglect contribution of last bucket

Value

Variance of combinedbucket is large

Time

- More careful estimation of last bucket’s contribution
- Decompose variance into two parts
- “Internal” variance: within bucket
- “External” variance: between buckets

Internal Varianceof Bucket i

External Variance

Internal Varianceof Bucket j

- When estimating contribution of last bucket:
- Internal variance charged evenly to each point
- External variance
- Pretend each point is at the average for its bucket
- Variance for bucket is small points aren’t too far from the average
- Points aren’t far from the average average is a good approx. for each point

- Spread is small external variance is small
- Spread is large error from “bucket averaging” insignificant

Value

Spread

Time

Current window, size = N

- Theorem: Relative error ≤ ε, provided Vm ≤ (ε2/9) Vm*
- Aim: Maintain Vm ≤ (ε2/9) Vm*using as few buckets as possible

………………

Bm-1

Bm

B2

B1

Bm*

- EH algorithm for variance:
- Initially buckets contain 1 item each
- Merge adjacent buckets i, i+1 whenever the following condition holds: (9/ε2)Vi,i-1 ≤ Vi-1*(i.e. variance of merged bucket is small compared to combined variance of later buckets)

- Invariant 1: (9/ε2)Vi ≤ Vi*
- Ensures that relative error is ≤ ε

- Invariant 2: (9/ε2)Vi,i-1 > Vi-1*
- Ensures that number of buckets = O((1/ε2)log N)
- Each bucket requires O(1) space

- Query Time: O(1)
- We maintain n, V & μ values for m and m*

- Update Time: O((1/ε2) log N) worst case
- Time to check and combine buckets
- Can be made amortized O(1)
- Merge buckets periodically instead of after each new data element

- Assignment distance substitutes for variance
- Assignment distance obtained from an approximate clustering of points in the bucket
- Use hierarchical clustering algorithm [GMMO’00]
- Original points cluster to give level-1 medians
- Level-i medians cluster to give level-(i+1) medians
- Medians weighted by count of assigned points

- Each bucket maintains a collection of medians at various levels

- Merging buckets
- Combine medians from each level i
- If they exceed Nτin number, cluster to get level i+1 medians.

- Estimation procedure
- Weighted clustering of all medians from all buckets to produce k overall medians

- Estimating contribution of last bucket
- Pretend each point is at the closest median
- Relies on approximate counts of active points assigned to each median

- See paper for details!

- Variance:
- Close gap between upper and lower bounds (1/ε log N vs. 1/ε2 log N)
- Improve update time from O(1) amortized to O(1) worst-case

- k-median clustering:
- [COP’03] give polylog N space approx. algorithm in streaming data model
- Can a similar result be obtained in the sliding window model?

- Algorithms to approximately maintain variance and k-median clustering in sliding window model
- Previous results using Exponential Histograms required “weak additivity”
- Not satisfied by variance or k-median clustering

- Adapted EHs for variance and k-median
- Techniques may be useful for other statistics that violate “weak additivity”