1 / 63

Sensor and Graph Mining

Sensor and Graph Mining. Christos Faloutsos Carnegie Mellon University & IBM www.cs.cmu.edu/~christos. Joint work with. Anthony Brockwell (CMU/Stat) Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Chenxi Wang (CMU) Yang Wang (CMU). Outline. Introduction - motivation

ordell
Download Presentation

Sensor and Graph Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sensor and Graph Mining Christos Faloutsos Carnegie Mellon University & IBM www.cs.cmu.edu/~christos

  2. Joint work with • Anthony Brockwell (CMU/Stat) • Deepayan Chakrabarti (CMU) • Spiros Papadimitriou (CMU) • Chenxi Wang (CMU) • Yang Wang (CMU) C. Faloutsos

  3. Outline Introduction - motivation Problem #1: Stream Mining Motivation Main idea Experimental results Problem #2: Graphs & Virus propagation Conclusions C. Faloutsos

  4. Introduction • Sensor devices • Temperature, weather measurements • Road traffic data • Geological observations • Patient physiological data • Embedded devices • Network routers • Intelligent (active) disks C. Faloutsos

  5. Limited resources Memory Bandwidth Power CPU Remote environments No human intervention Introduction C. Faloutsos

  6. Introduction – problem dfn • Given a emi-infinite stream of values (time series) x1, x2, …, xt, … • Find patterns, forecasts, outliers… C. Faloutsos

  7. “Noise”?? Introduction • E.g., Periodicity? (twice daily) Periodicity? (daily) C. Faloutsos

  8. Periodicity? (twice daily) “Noise”?? Periodicity? (daily) Introduction • Can we capture these patterns • automatically • with limited resources? C. Faloutsos

  9. Related workStatistics: Time series forecasting • Main problem: “[…] The first step in the analysis of any time series is to plot the data [and inspect the graph]”[Brockwell 91] • Typically: • Resource intensive • Cannot update online • AR(I)MA and seasonal variants • ARFIMA, GARCH, … C. Faloutsos

  10. Related workDatabases: Continuous Queries • Typically, different focus: • “Compression” • Not generative models • Largely orthogonal problem… • Gilbert, Guha, Indyk et al. (STOC 2002) • Garofalakis, Gibbons (SIGMOD 2002) • Chen, Dong, Han et al. (VLDB 2002); Bulut, Singh (ICDE 2003) • Gehrke, Korn, et al. (SIGMOD 2001), Dobra, Garofalakis, Gehrke et al. (SIGMOD 2002) • Guha, Koudas (ICDE 2003) Datar, Gionis, Indyk et al. (SODA 2002) • Madden+ [SIGMOD02], [SIGMOD03] C. Faloutsos

  11. Goals • Adapt and handle arbitrary periodic components • No human intervention/tuning Also: • Single pass over the data • Limited memory (logarithmic) • Constant-time update C. Faloutsos

  12. Outline Introduction - motivation Problem #1: Stream Mining Motivation Main idea Experimental results Problem #2: Graphs & Virus propagation Conclusions C. Faloutsos

  13. xt t I1 I8 I2 I7 I3 I4 I5 I6 t t t t t t t t Wavelets“Straight” signal C. Faloutsos time

  14. W1,3 W1,1 W1,4 W1,2 t t t t xt W2,1 W2,2 t t t W3,1 t V4,1 t WaveletsIntroduction – Haar frequency C. Faloutsos time

  15. So? Wavelets compress many real signals well… Image compression and processing Vision; Astronomy, seismology, … Wavelet coefficients can be updated as new points arrive [Kotidis+] Wavelets C. Faloutsos

  16. W1,3 t W1,1 W1,4 W1,2 t t t t W2,1 W2,2 = t t W3,1 t V4,1 t WaveletsCorrelations xt frequency C. Faloutsos time

  17. W1,3 t W1,1 W1,4 W1,2 t t t t W2,1 W2,2 t t W3,1 t V4,1 t WaveletsCorrelations xt frequency C. Faloutsos time

  18. Main ideaCorrelations • Wavelets are good… • …we can do even better • One number… • …and the fact that they are equal/correlated C. Faloutsos

  19. Wl,t-2 Wl,t-1 Wl,t Wl’,t’-2 Wl’,t’-1 Proposed method Wl,t l,1Wl,t-1l,2Wl,t-2 … Wl’,t’ l’,1Wl’,t’-1l’,2Wl’,t’-2 … Wl’,t’ Small windows suffice… (k~4) C. Faloutsos

  20. More details… • Update of wavelet coefficients • Update of linear models • Feature selection • Not all correlations are significant • Throw away the insignificant ones • very important!! [see paper] (incremental) (incremental; RLS) (single-pass) C. Faloutsos

  21. SKIP Complexity • Model update Space:OlgN + mk2 OlgN Time:Ok2 O1 Where • N: number of points (so far) • k: number of regression coefficients; fixed • m: number of linear models; OlgN [see paper] C. Faloutsos

  22. Outline Introduction - motivation Problem #1: Stream Mining Motivation Main idea Experimental results Problem #2: Graphs & Virus propagation Conclusions C. Faloutsos

  23. Setup • First half used for model estimation • Models applied forward to forecast entire second half • AR, Seasonal AR (SAR): R • Simplest possible estimation – no maximum likelihood estimation (MLE), etc. • … vs. Python scripts C. Faloutsos

  24. ResultsSynthetic data – Triangle pulse • Triangle pulse • AR captures wrong trend (or none) • Seasonal AR (SAR) estimation fails C. Faloutsos

  25. ResultsSynthetic data – Mix • Mix (sine + square pulse) • AR captures wrong trend (or none) • Seasonal AR estimation fails C. Faloutsos

  26. ResultsReal data – Automobile • Automobile traffic • Daily periodicity with rush-hour peaks • Bursty “noise” at smaller time scales (filtered) C. Faloutsos

  27. ResultsReal data – Automobile • Automobile traffic • Daily periodicity with rush-hour peaks • Bursty “noise” at smaller time scales • AR fails to capture any trend (average) • Seasonal AR estimation fails C. Faloutsos

  28. ResultsReal data – Automobile • Automobile traffic • Daily periodicity with rush-hour peaks • Bursty “noise” at smaller time scales • AWSOM spots periodicities, automatically C. Faloutsos

  29. ResultsReal data – Automobile • Automobile traffic • Daily periodicity with rush-hour peaks • Bursty “noise” at smaller time scales • Generation with identified noise C. Faloutsos

  30. ResultsReal data – Sunspot • Sunspot intensity – Slightly time-varying “period” • AR captures wrong trend (average) • Seasonal ARIMA • Captures immediate wrong downward trend • Requires human to determine seasonal component period (fixed) C. Faloutsos

  31. ResultsReal data – Sunspot • Sunspot intensity – Slightly time-varying “period” Estimation: 40 minutes (R) vs. 9 seconds (Python) C. Faloutsos

  32. ~ 1 hour SKIP Variance • Variance (log-power) vs. scale: • “Noise” diagnostic (if decreasing linear…) • Can use to estimate noise parameters ~Hurst exponent C. Faloutsos

  33. Running time time (t) stream size (N) C. Faloutsos

  34. Space requirements Equal total number of model parameters C. Faloutsos

  35. Conclusion • Adapt and handle arbitrary periodic components • No human intervention/tuning • Single pass over the data • Limited memory (logarithmic) • Constant-time update C. Faloutsos

  36. Conclusion • Adapt and handle arbitrary periodic components • No human intervention/tuning • Single pass over the data • Limited memory (logarithmic) • Constant-time update no human limited resources C. Faloutsos

  37. Outline Introduction - motivation Problem #1: Streams Problem #2: Graphs & Virus propagation Motivation & problem definition Related work Main idea Experiments Conclusions C. Faloutsos

  38. Introduction Protein Interactions [genomebiology.com] Internet Map [lumeta.com] Food Web [Martinez ’91] ► Graphs are ubiquitious Friendship Network [Moody ’01] C. Faloutsos

  39. Introduction “bridges” • What can we do with graph analysis? • Immunization; • Information Dissemination • network value of a customer [Domingos+] “Needle exchange” networks of drug users[Weeks et al. 2002] C. Faloutsos

  40. Problem definition • Q1: How does a virus spread across an arbitrary network? • Q2: will it create an epidemic? • (in a sensor setting, with a ‘gossip’ protocol, will a rumor/query spread?) C. Faloutsos

  41. Infected by neighbor Susceptible/ healthy Infected & infectious Cured internally Framework • Susceptible-Infected-Susceptible (SIS) model • Cured nodes immediately become susceptible C. Faloutsos

  42. Prob. δ Prob. β Prob. β The model • (virus) Birth rate β : probability than an infected neighbor attacks • (virus) Death rate δ : probability that an infected node heals Healthy N2 N1 N Infected N3 C. Faloutsos

  43. Epidemic threshold t Defined as the value of t, such that if b / d < t an epidemic can not happen Thus, • given a graph • compute its epidemic threshold C. Faloutsos

  44. Epidemic threshold t What should t depend on? • avg. degree? and/or highest degree? • and/or variance of degree? • and/or determinant of the adjacency matrix? C. Faloutsos

  45. Basic Homogeneous Model Homogeneous graphs [Kephart-White ’91, ’93] • Epidemic threshold = 1/<k> • Homogeneous connectivity <k>, ie, all nodes have ~same degree unrealistic C. Faloutsos

  46. Power-law Networks • Model for Barabási-Albert networks • [Pastor-Satorras & Vespignani, ’01, ’02] • Epidemic threshold = <k> / <k2> • for BA type networks, with onlyγ = 3 (γ = slope of power-law exponent) C. Faloutsos

  47. Epidemic threshold • Homogeneous graphs: 1/<k> • BA (g=3) <k> / <k2> • more complicated graphs ? • arbitrary, REAL graphs ? • how many parameters?? C. Faloutsos

  48. Epidemic threshold • [Theorem] We have no epidemic, if β/δ <τ= 1/ λ1,A C. Faloutsos

  49. Epidemic threshold • [Theorem] We have no epidemic, if epidemic threshold recovery prob. β/δ <τ= 1/ λ1,A largest eigenvalue of adj. matrix A attack prob. Proof: [Wang+03] C. Faloutsos

  50. Epidemic threshold for various networks • sanity checks / older results: • Homogeneous networks • λ1,A = <k>; τ = 1/<k> • where <k> = average degree • This is the same result as of Kephart & White ! C. Faloutsos

More Related