1 / 34

Beyond Streams and Graphs: Dynamic Tensor Analysis

Beyond Streams and Graphs: Dynamic Tensor Analysis. Dacheng Tao. Christos Faloutsos. Jimeng Sun. Speaker: Jimeng Sun. Motivation. Goal: incremental pattern discovery on streaming applications Streams: E1: Environmental sensor networks E2: Cluster/data center monitoring Graphs:

Download Presentation

Beyond Streams and Graphs: Dynamic Tensor Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Beyond Streams and Graphs: Dynamic Tensor Analysis Dacheng Tao Christos Faloutsos Jimeng Sun Speaker: Jimeng Sun

  2. Motivation • Goal: incremental pattern discovery on streaming applications • Streams: • E1: Environmental sensor networks • E2: Cluster/data center monitoring • Graphs: • E3: Social network analysis • Tensors: • E4: Network forensics • E5: Financial auditing • E6: fMRI: Brain image analysis • How to summarize streaming data effectively and incrementally?

  3. E3: Social network analysis • Traditionally, people focus on static networks and find community structures • We plan to monitor the change of the community structure over time and identify abnormal individuals

  4. destination source E4: Network forensics • Directional network flows • A large ISP with 100 POPs, each POP 10Gbps link capacity [Hotnets2004] • 450 GB/hour with compression • Task: Identify abnormal traffic pattern and find out the cause abnormal traffic normal traffic destination source Collaboration with Prof. Hui Zhang and Dr. Yinglian Xie

  5. Static Data model • For a timestamp, the stream measurements can be modeled using a tensor • Dimension: a single stream • E.g, <Christos, “graph”> • Mode: a group of dimensions of the same kind. • E.g., Source, Destination, Port Time = 0 Source Destination

  6. Static Data model (cont.) • Tensor • Formally, • Generalization of matrices • Represented as multi-array, data cube.

  7. Dynamic Data model (our focus) • Streams come with structure • (time, source, destination, port) • (time, author, keyword) Source Destination time

  8. where n is increasing over time time … Dynamic Data model (cont.) • Tensor Streams • A sequence of Mth order tensor keyword time author … …

  9. Dynamic tensor analysis Old Tensors New Tensor Source Destination UDestination Old cores USource

  10. Roadmap • Motivation and main ideas • Background and related work • Dynamic and streaming tensor analysis • Experiments • Conclusion

  11. R R R Background – Singular value decomposition (SVD) • SVD • Best rank k approximation in L2 • PCA is an important application of SVD n n k k k VT A  U UT m m Y

  12. Latent semantic indexing (LSI) • Singular vectors are useful for clustering or correlation detection cluster cache frequent query pattern concept-association DM x x = DB document-concept concept-term

  13. 5 7 6 8 1 32 4 Tensor Operation: Matricize X(d) • Unfold a tensor into a matrix Acknowledge to Tammy Kolda for this slide

  14. Tensor Operation: Mode-product • Multiply a tensor with a matrix port port source source destination destination group group source

  15. Related work Our Work

  16. Roadmap • Motivation and main ideas • Background and related work • Dynamic and streaming tensor analysis • Experiments • Conclusion

  17. Tensor analysis Given a sequence of tensors find the projection matrices such that the reconstruction error e is minimized: … t … Note that this is a generalization of PCA when n is a constant

  18. Why do we care? • Anomaly detection • Reconstruction error driven • Multiple resolution • Multiway latent semantic indexing (LSI) Philip Yu time Michael Stonebreaker Pattern Query

  19. 1st order DTA - problem Given x1…xn where each xi RN, find URNR such that the error e is small: N Y UT x1 R ? Sensors …. n time xn indoor Note that Y = XU Sensors outdoor

  20. 1st order DTA Input: new data vector x RN, old variance matrix C RN N Output: new projection matrix U RN R Algorithm: 1. update variance matrix Cnew = xTx + C 2. Diagonalize UUT =Cnew 3. Determine the rank R and return U Old X time x x UT Cnew U C xT Diagonalization has to be done for every new x!

  21. 1st order STA • Adjust U smoothly when new data arrive without diagonalization [VLDB05] • For each new pointx • Project onto current line • Estimate error • Rotate line in the direction of the error and in proportion to its magnitude For each new point x and for i = 1, …, k : • yi := UiTx (proj. onto Ui) • didi + yi2 (energy  i-th eigenval.) • ei := x – yiUi (error) • UiUi + (1/di) yiei (update estimate) • xx – yiUi (repeat with remainder) error Sensor 2 U Sensor 1

  22. Mth order DTA

  23. Mth order DTA – complexity Storage: O( Ni), i.e., size of an input tensor at a single timestamp Computation:  Ni3 (or  Ni2) diagonalization of C +  Ni Nimatrix multiplication X (d)T X(d) For low order tensor(<3), diagonalization is the main cost For high order tensor, matrix multiplicationis the main cost

  24. x U1 updated e1 U1 y1 Mth order STA • Run 1st order STA along each mode • Complexity: • Storage: O( Ni) • Computation:  Ri Niwhich is smaller than DTA

  25. Roadmap • Motivation and main ideas • Background and related work • Dynamic and streaming tensor analysis • Experiments • Conclusion

  26. Experiment • Objectives • Computational efficiency • Accurate approximation • Real applications • Anomaly detection • Clustering

  27. Data set 1: Network data • TCP flows collected at CMU backbone • Raw data 500GB with compression • Construct 3rd order tensors with hourly windows with <source, destination, port> • Each tensor: 500500100 • 1200 timestamps (hours) value Sparse data Power-law distribution 10AM to 11AM on 01/06/2005

  28. Data set 2: Bibliographic data (DBLP) • Papers from VLDB and KDD conferences • Construct 2nd order tensors with yearly windows with <author, keywords> • Each tensor: 45843741 • 11 timestamps (years)

  29. Computational cost • OTA is the offline tensor analysis • Performance metric: CPU time (sec) • Observations: • DTA and STA are orders of magnitude faster than OTA • The slight upward trend in DBLP is due to the increasing number of papers each year (data become denser over time) 3rd order network tensor 2nd order DBLP tensor

  30. Accuracy comparison • Performance metric: the ratio of reconstruction error between DTA/STA and OTA; fixing the error of OTA to 20% • Observation: DTA performs very close to OTA in both datasets, STA performs worse in DBLP due to the bigger changes. 3rd order network tensor 2nd order DBLP tensor

  31. Reconstruction error over time Normal traffic Abnormal traffic Network anomaly detection • Reconstruction error gives indication of anomalies. • Prominent difference between normal and abnormal ones is mainly due to unusual scanning activity (confirmed by the campus admin).

  32. Multiway LSI DB DM • Two groups are correctly identified: Databases and Data mining • People and concepts are drifting over time

  33. Conclusion • Tensor stream is a general data model • DTA/STA incrementally decompose tensors into core tensors and projection matrices • The result of DTA/STA can be used in other applications • Anomaly detection • Multiway LSI

  34. Final word:Think structurally! • The world is not flat, neither should data mining be. Contact: Jimeng Sun jimeng@cs.cmu.edu

More Related