1 / 60

Incremental Pattern Discovery on Streams, Graphs and Tensors

Ph.D.Thesis Proposal. Incremental Pattern Discovery on Streams, Graphs and Tensors. Jimeng Sun. May 15, 2006. Thesis Committee. Christos Faloutsos (Chair) Tom Mitchell Hui Zhang David Steier, PricewaterhouseCoopers Philip Yu, IBM Watson Research Center. Thesis Proposal.

mgreene
Download Presentation

Incremental Pattern Discovery on Streams, Graphs and Tensors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ph.D.Thesis Proposal Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun May 15, 2006

  2. Thesis Committee • Christos Faloutsos (Chair) • Tom Mitchell • Hui Zhang • David Steier, PricewaterhouseCoopers • Philip Yu, IBM Watson Research Center

  3. Thesis Proposal • Goal: incremental pattern discovery on streaming applications • Streams: • E1: Environmental sensor networks • E2: Cluster/data center monitoring • Graphs: • E3: Social network analysis • Tensors: • E4: Network forensics • E5: Financial auditing • E6: fMRI: Brain image analysis • How to summarize streaming data efficiently and incrementally?

  4. Phase 1 Phase 2 Phase 3 : : : : : : chlorine concentrations : : : : : : E1: Environmental Sensor Monitoring CMU civil department Prof. Jeanne M. VanBriesen sensors near leak sensors away from leak water distribution network normal operation May have hundreds of measurements, and they are often related!

  5. Phase 1 Phase 2 Phase 3 : : : : : : : : : : : : E1: Environmental Sensor Monitoring CMU civil department Prof. Jeanne M. VanBriesen sensors near leak chlorine concentrations sensors away from leak water distribution network normal operation major leak May have hundreds of measurements, and they are often related!

  6. : : : : : : : : : : : : E1: Environmental Sensor Monitoring SPIRIT Phase 1 Phase 2 Phase 3 Phase 1 Phase 2 Phase 3 chlorine concentrations k = 1-2 actual measurements (n streams) k hidden variable(s) We would like to discover a few “hidden (latent) variables” that summarize the key trends

  7. E3: Social network analysis • Traditionally, people focus on static networks and find community structures • We plan to monitor the change of the community structure over time and identify abnormal individuals

  8. destination source E4: Network forensics • Directional network flows • A large ISP with 100 POPs, each POP 10Gbps link capacity [Hotnets2004] • 450 GB/hour with compression • Task: Identify abnormal traffic pattern and find out the cause abnormal traffic normal traffic destination source Collaboration with Prof. Hui Zhang and Dr. Yinglian Xie

  9. Commonality of all • Data: continuously arriving • Large volume • Multi-dimensional • Unlabeled • Task: incremental pattern discovery • Main trends • Anomalies

  10. Thesis statement • Incremental and efficient summarization of heterogonous streaming data through a general and concise presentation enables many real applications in different domains.

  11. Outline • Motivating examples • Data model and mining framework • Related work • Current work • Proposed work • Conclusion

  12. Static Data model • Tensor • Formally, • Generalization of matrices • Represented as multi-array, data cube.

  13. where n is increasing over time time … Dynamic Data model (our focus) • Tensor Streams • A sequence of Mth order tensor keyword time author … …

  14. Our framework for incremental pattern discovery Mining flow Data Streams Preprocessing Tensor Streams Tensor Analysis Projections Core tensors Application Modules Anomaly Detection Clustering Prediction

  15. Outline • Motivating examples • Data model and mining framework • Related work • Current work • Proposed work • Conclusion

  16. Related work Our Work

  17. R R R Background – Singular value decomposition (SVD) • SVD • Best rank k approximation in L2 • PCA is an important application of SVD • Note that U and V are dense and may have negative entries n n k k k VT A  U UT m m Y

  18. Background – Latent semantic indexing (LSI) • Singular vectors are useful for clustering cluster cache frequent query pattern concept-association DM x x = DB document-concept concept-term

  19. Background: Tensor Operations • Matricizing • Unfold a tensor into a matrix Port Destination*Port Destination Source Source Source

  20. Background: Tensor Operations • Mode-product • Multiply a tensor with a matrix Port Port source source Destination Destination “group” “group” Source

  21. Outline • Data model • Framework • Related work • Current work • Dynamic and Streaming tensor analysis (DTA/STA) • Compact matrix decomposition (CMD) • Proposed work • Conclusion

  22. Methodology map data order

  23. Tensor analysis Given a sequence of tensors find the projection matrices such that the reconstruction error e is minimized: … t … Note that this is a generalization of PCA when n is a constant

  24. Why do we care? • Anomaly detection • Reconstruction error driven • Multiple resolution • Multiway latent semantic indexing (LSI) Philip Yu time Michael Stonebreaker Pattern Query

  25. 1st order DTA - problem Given x1…xn where each xi RN, find URNR such that the error e is small: N Y UT x1 R ? Sensors …. n time xn indoor Note that Y = XU Sensors outdoor

  26. 1st order DTA Input: new data vector x RN, old variance matrix C RN N Output: new projection matrix U RN R Algorithm: 1. update variance matrix Cnew = xTx + C 2. Diagonalize UUT =Cnew 3. Determine the rank R and return U Old X time x x UT Cnew U C xT Diagonalization has to be done for every new x!

  27. 1st order STA: SPIRIT • Adjust U smoothly when new data arrive without diagonalization • For each new pointx • Project onto current line • Estimate error • Rotate line in the direction of the error and in proportion to its magnitude For each new point x and for i = 1, …, k : • yi := UiTx (proj. onto Ui) • didi + yi2 (energy  i-th eigenval.) • ei := x – yiUi (error) • UiUi + (1/di) yiei (update estimate) • xx – yiUi (repeat with remainder) error Sensor 2 U Sensor 1

  28. Mth order DTA

  29. Mth order DTA – complexity Storage: O( Ni), i.e., size of an input tensor at a single timestamp Computation:  Ni3 (or  Ni2) diagonalization of C +  Ni Nimatrix multiplication X (d)T X(d) For low order tensor(<3), diagonalization is the main cost For high order tensor, matrix multiplicationis the main cost

  30. x U1 updated e1 U1 y1 Streaming tensor analysis (STA) • Run SPIRIT along each mode • Complexity: • Storage: O( Ni) • Computation:  Ri Niwhich is smaller than DTA

  31. Experiment • Goal • Computation efficiency • Accurate approximation • Real applications • Anomaly detection • Clustering

  32. Data set 1: Network data • TCP flows collected at CMU backbone • Raw data 500GB with compression • Construct 2nd or 3rd order tensors with hourly windows with <source, destination,value> or <source, destination, port, value> • Each tensor: 500500 or 500500100 biased sampled from over 22k hosts • 1200 timestamps (hours) Sparse data Power-law distribution 10AM to 11AM on 01/06/2005

  33. Data set 2: Bibliographic data (DBLP) • Papers from VLDB and KDD conferences • Construct 2nd order tensors with yearly windows with <author, keywords, num> • Each tensor: 45843741 • 11 timestamps (years)

  34. Computational cost • OTA is the offline tensor analysis • Performance metric: CPU time (sec) • Observations: • DTA and STA are orders of magnitude faster than OTA • The slide upward trend in DBLP is due to the increasing number of papers each year (data become denser over time) 3rd order network tensor 2nd order DBLP tensor

  35. Accuracy comparison • Performance metric: the ratio of reconstruction error between DTA/STA and OTA; fixing the error of OTA to 20% • Observation: DTA performs very close to OTA in both datasets, STA performs worse in DBLP due to the bigger changes. 3rd order network tensor 2nd order DBLP tensor

  36. Reconstruction error over time Normal traffic Abnormal traffic Network anomaly detection • Reconstruction error gives indication of anomalies. • Prominent difference between normal and abnormal ones is mainly due to the unusual scanning activity (confirmed by the campus admin).

  37. Multiway LSI DB DM • Two groups are correctly identified: Databases and Data mining • People and concepts are drifting over time

  38. Quick summary of DTA/STA • Tensor stream is a general data model • DTA/STA incrementally decompose tensors into core tensors and projection matrices • The result of DTA/STA can be used in other applications • Anomaly detection • Multiway LSI Incremental computation!

  39. Outline • Data model • Framework • Related work • Current work • Dynamic and Streaming tensor analysis (DTA/STA) • Compact matrix decomposition (CMD) • Proposed work • Conclusion

  40. Methodology map data order

  41. Disadvantage of orthogonal projection on sparse data • Real data are often (very) sparse • Orthogonal projection does not preserve the sparsity in the data • more space than original data • large computational cost

  42. Interpretability problem of orthogonal projection • Each column of projection matrix Ui is a linear combination of all dimensions along certain mode Ui(:,1) = [0.5; -0.5; 0.5; 0.5] • All the data are projected onto the span of Ui • It is hard to interpret the projections

  43. Compact matrix decomposition (CMD) • Example-based projection: use actual rows and columns to specify the subspace • Given a matrix ARmn, find three matrices C Rmc, U Rcr, R Rr n , such that ||A-CUR|| is small Example-based Orthogonal projection U is the pseudo-inverse of X

  44. CMU from 4K feet CMD algorithm (high level)

  45. 1 1 1 1 1 0 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 R U Rd A Cd CMD algorithm (high level) • Biased sample with replacement of columns and rows from A • Remove duplicates with proper scaling • Construct U from C and R (pseudo-inverse of the intersection of C and R) Remove duplicates with proper scaling C

  46. CMD algorithm (low level) CMU from 3 feet

  47. CMD algorithm (low level) • Remove duplicates with proper scaling Ci = ui1/2 Ci Ri = vi Ri ui, vi the number of occurrences of Ci and Ri Theorem: Matrix C and Cd have the same singular values and left singular vectors Proof: see [Sun06]

  48. Experiment • Datasets • Performance metrics • Space ratio to the original data • CPU time (sec) • Accuracy = 1 – reconstruction error

  49. Space efficiency • CMD uses much smaller space to achieve the same accuracy • CUR limitation: duplicate columns and rows • SVD limitation: orthogonal projection densifies the data Network DBLP

  50. Computational efficiency • CMD is fastest among all three • CMD and CUR requires SVD on only the sampled columns • CUR is much worse than CMD due to duplicate columns • SVD is slowest since it performs on the entire data DBLP Network

More Related