Statistical approaches to mining multivariate data streams
This presentation is the property of its rightful owner.
Sponsored Links
1 / 56

Statistical Approaches to Mining Multivariate Data Streams PowerPoint PPT Presentation


  • 91 Views
  • Uploaded on
  • Presentation posted in: General

Statistical Approaches to Mining Multivariate Data Streams. Eric Vance, Duke University Department of Statistical Science David Banks, Duke University Tamraparni Dasu, AT&T Labs - Research. JSM July 31, 2007 Salt Lake City, Utah. Data Streams. Huge amounts of complex data

Download Presentation

Statistical Approaches to Mining Multivariate Data Streams

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Statistical approaches to mining multivariate data streams

Statistical Approaches to Mining Multivariate Data Streams

Eric Vance, Duke University

Department of Statistical Science

David Banks, Duke University

Tamraparni Dasu, AT&T Labs - Research

JSM July 31, 2007

Salt Lake City, Utah


Data streams

Data Streams

  • Huge amounts of complex data

  • Rapid rate of accumulation

  • One-time access to raw data


Change detection

Change Detection

  • Problem: Change detection in complicated data streams

  • Three criteria

    • Nonparametric

    • Fast

    • Statistical guarantees


E commerce server data

E-Commerce Server Data

  • Data description

    • One server in a network of servers

    • Data polled every 5 minutes

    • 27 variables

    • 5 week time period

  • Quality issues

    • Many variables unimportant or unchanging

    • Missing data


E commerce server data1

E-Commerce Server Data

  • Variable Elimination

    • Several variables remain constant (Total Swap)

    • Predictable and non-informative (Used SysDisk Space)

    • Correlated (CPU Used%, CPU User%)

  • 6 Variables Selected

    • CPU Used%

    • Number of Procs

    • Number of Threads

    • Ping Latency

    • Used Swap

    • Log Packets = log(In Packets + Out Packets)


Daily boxplots

Daily Boxplots

CPU

Procs

Threads


Daily boxplots1

Daily Boxplots

Ping

Swap

Packets


Our approach

Our Approach

  • Partitioning scheme (Multivariate Histogram)

    • Data depth: Rank each point in relation to Mahalanobis distance from center of data

    • Data Pyramid: Determine which direction in the data is most extreme

  • Profile based comparison

    • Identify changes in profiles over time (week to week)


  • Partitioning example in 2d

    Partitioning Example in 2D

    • 5 “center-outward” depth layers

    • 4 pyramids

    A: depth 4

    pyramid +y

    B: depth 3

    pyramid -x

    C: depth 1

    pyramid +x

    A

    C

    B


    Identify depth and direction

    Identify Depth and Direction

    • Compute center of comparison Data Sphere

    • Calculate Mahalanobis distance for each point

    • In which of the 5 quantiles of depth is

    • Determine direction of greatest variation for


    Data partition in 6 dimensions

    Data Partition in 6 Dimensions


    Partitioning into bins

    Partitioning Into Bins


    Partitioning into bins1

    Partitioning Into Bins

    CPU


    Partitioning into bins2

    Partitioning Into Bins

    CPU

    CPU


    Partitioning into bins3

    Partitioning Into Bins

    CPU

    Procs


    Partitioning into bins4

    Partitioning Into Bins

    Procs

    CPU

    Procs


    Partitioning into bins5

    Partitioning Into Bins

    Procs

    CPU

    Threads


    Partitioning into bins6

    Partitioning Into Bins

    Procs

    CPU

    Threads

    Threads


    Partitioning into bins7

    Partitioning Into Bins

    Procs

    CPU

    Threads

    Threads

    Threads


    Partitioning into bins8

    Partitioning Into Bins

    Procs

    CPU

    Threads

    Threads

    Threads

    Threads


    Partitioning into bins9

    Partitioning Into Bins

    Procs

    CPU

    Threads

    Ping


    Partitioning into bins10

    Partitioning Into Bins

    Procs

    CPU

    Threads

    Ping

    Ping


    Partitioning into bins11

    Partitioning Into Bins

    Procs

    CPU

    Threads

    Ping

    Swap


    Partitioning into bins12

    Partitioning Into Bins

    Procs

    CPU

    Threads

    Ping

    Swap

    Swap


    Partitioning into bins13

    Partitioning Into Bins

    Procs

    CPU

    Threads

    Ping

    Swap

    Packets


    Partitioning into bins14

    Partitioning Into Bins

    Procs

    CPU

    Packets

    Threads

    Ping

    Swap

    Packets


    Partitioning into bins weeks 1 and 2

    Partitioning Into Bins: Weeks 1 and 2


    Partitioning into bins weeks 1 and 21

    Partitioning Into Bins: Weeks 1 and 2

    Swap


    Partitioning into bins weeks 1 and 22

    Partitioning Into Bins: Weeks 1 and 2

    Swap

    Swap


    Partitioning into bins weeks 1 and 23

    Partitioning Into Bins: Weeks 1 and 2

    Swap

    Swap

    Swap


    Partitioning into bins weeks 1 and 24

    Partitioning Into Bins: Weeks 1 and 2

    Swap

    Swap


    Partitioning into bins weeks 1 and 25

    Partitioning Into Bins: Weeks 1 and 2

    Threads


    Partitioning into bins weeks 1 and 26

    Partitioning Into Bins: Weeks 1 and 2

    Threads

    Threads


    6 variables over 5 weeks

    6 Variables Over 5 Weeks


    Week 1 results

    Week 1 Results


    Week 2 results

    Week 2 Results


    6 variables over 5 weeks1

    6 Variables Over 5 Weeks


    6 variables over 5 weeks2

    6 Variables Over 5 Weeks

    Threads


    Week 3 results

    Week 3 Results


    Week 3 results1

    Week 3 Results

    Threads


    Week 4 results

    Week 4 Results


    Week 4 results1

    Week 4 Results

    Procs


    Week 4 results2

    Week 4 Results

    Threads


    6 variables over 5 weeks3

    6 Variables Over 5 Weeks


    6 variables over 5 weeks4

    6 Variables Over 5 Weeks

    Packets


    Week 5 results

    Week 5 Results


    Week 5 results1

    Week 5 Results

    Packets


    Trimmed mean and covariance

    Trimmed Mean and Covariance

    • Applying partitioning method using a trimmed mean and trimmed covariance matrix

      • Use 90% most central data points

      • Recompute center

      • Recompute Mahalanobis distances using new center and new covariance matrix

    • Bins are more uniformly filled

      • More points appear closer to trimmed center

      • More variables become “most extreme”


    Trimmed comparison week 1

    Trimmed Comparison: Week 1


    Trimmed results week 1

    Trimmed Results: Week 1


    Trimmed results week 2

    Trimmed Results: Week 2


    Trimmed results week 3

    Trimmed Results: Week 3


    Trimmed results week 4

    Trimmed Results: Week 4


    Trimmed results week 5

    Trimmed Results: Week 5


    Multivariate tests of statistical significance

    Multivariate Tests of Statistical Significance

    • Multinomial test

      • Chi-squared test, G-test

    • Bootstrap and Resampling

    • Bayesian methods


    Conclusion

    Conclusion

    • Quickly categorize data points non-parametrically

    • Compare bins

    • Identify which variables change most

    • Questions or comments to

      ervance @ stat.duke.edu


  • Login