1 / 16

Maintaining Variance over Data Stream Windows

Maintaining Variance over Data Stream Windows. Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles of Database Systems (PODS 2003), June 2003 Presented by C.-L. Lin. Data streams.

roch
Download Presentation

Maintaining Variance over Data Stream Windows

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan, Stanford University ACM Symp. on Principles of Database Systems (PODS 2003), June 2003 Presented by C.-L. Lin

  2. Data streams • Traditional DBMS( Data Base Management System) – data stored in finite andpersistentdata sets, one time query, relatively low update rate. • Data Stream Management System (DSMS) - data input as continuous, possiblyinfinitedata streams, continuous queries .. etc. • An example of continuous query :In a telecom company, we are interested infinding all outgoing calls longer than 2 minutes • New Applications • Sensor networks • Network monitoring and traffic engineering • Telecom call records • Network security • Financial applications • Manufacturing processes • Web logs and clickstreams • Massive data sets

  3. 32 32 13 16 32 5 13 7 Data Stream Management System (DSMS) User/Application Query Query Results Stream Query Processor Summary data (sum, count, Variance…) (Limited Memory and/or Disk)

  4. Sliding Window Model Time Increases Timestamps 7 6 5 4 3 2 1 ….2 14 14 15 11 6 7 4 3 47 14 15 7 5 10 11 4 1 21 1 4 7 … Window Size = N Expired data Future data Current Time • 1. When N is large ( many hours, days and months) , we cannot buffer • the entire sliding window in memory. •  O(NlogR) bits of memory is required, where R is the upper bound • on the absolute value of the data. • So we cannot compute the sum, count, variance exactly at every instant. • Approximately compute variance over sliding window, and use as small memory as possible. • 

  5. Review (1) Mean (2) Variance (3) Relative estimation error

  6. The Concept of Buckets(1/2) Time Increases Timestamps 9 8 7 6 5 4 3 2 1 Elements ….2 14 14 15 11 6 7 4 3 47 14 15 7 5 10 11 4 1 ... 2 14 14 15 11 6 7 4 3 47 14 15 B4 B3 B2 B1 Bucket Timestamps 9 6 3 1 Suffix buckets

  7. The Concept of Buckets (2/2) Bm Bm-1 B3 B2 B1 Time Bm* Window size N • For each bucket Bi, maintain  Proof later

  8. Estimated Variance Bm Bm-1 B3 B2 B1 Time Bm* Window size N  Error!!!

  9. Lemma 1 Proof. Define δi=μi -μi,jδj=μj -μi,j

  10. When a new xt element arrives.. • create a new bucket for xt. The new bucket becomes B1 with V1=0, μ1= xt, n1 =1. An old bucket Bi becomes Bi+1. • if tm > N, delete the bucket. Bucket Bm-1becomes the new oldest bucket.  update Bm-1*

  11. Bucket Merge • Invariant 1For every bucket Bi, • Ensures that the relative error is ≤ε • Invariant 2For each i<1, for every bucket Bi, • This invariant insures that the total number of buckets is small  O((1/ε2)log NR2)

  12. Number of Buckets • Lemma 2: The number of buckets maintained at any point in time by an algorithm that preserves Invariant 2 isO(1/ε2logNR2) where R is an upper bound on the absolute value of the data elements. <proof.> • From the merge rule : the variance of the union of two buckets is no less then the sum of the individual variances. • By invariant 2, the variance of the suffix bucket Bi* doubles after every O(1/ε2) buckets. • Total number of buckets: no more then O(1/ε2 logV) where V is the variance of the last N points. V is no more than NR2.  O(1/ε2 log NR2) V3,4 V3* V5*

  13. Space Complexity (1) By lemma 2,the number of buckets maintained at any point in time by an algorithm isO(1/ε2logNR2) (2) Each bucket requires constant space : ==> Overall memory is O(1/ε2logNR2) But………… • Timestamps : O(logN) • Bucket size : • Mean: • Variance: O(logV) = O(logNR2)

  14. Estimation Error • Estimated Variance: • Actual Variance: • Error: (3) (1) (2)

More Related