630 likes | 721 Views
Explore stream mining and graph analysis using sensor data with limited resources, offering automated pattern detection and forecasting. Discover innovative techniques for detecting periodicity and correlations in time series data. Benefit from advanced wavelet methods for efficient data compression and predictive modeling.
E N D
Sensor and Graph Mining Christos Faloutsos Carnegie Mellon University & IBM www.cs.cmu.edu/~christos C. Faloutsos
Joint work with • Anthony Brockwell (CMU/Stat) • Deepayan Chakrabarti (CMU) • Spiros Papadimitriou (CMU) • Chenxi Wang (CMU) • Yang Wang (CMU) C. Faloutsos
Outline Introduction - motivation Problem #1: Stream Mining Motivation Main idea Experimental results Problem #2: Graphs & Virus propagation Conclusions C. Faloutsos
Introduction • Sensor devices • Temperature, weather measurements • Road traffic data • Geological observations • Patient physiological data • Embedded devices • Network routers • Intelligent (active) disks C. Faloutsos
Limited resources Memory Bandwidth Power CPU Remote environments No human intervention Introduction C. Faloutsos
Introduction – problem dfn • Given a emi-infinite stream of values (time series) x1, x2, …, xt, … • Find patterns, forecasts, outliers… C. Faloutsos
“Noise”?? Introduction • E.g., Periodicity? (twice daily) Periodicity? (daily) C. Faloutsos
Periodicity? (twice daily) “Noise”?? Periodicity? (daily) Introduction • Can we capture these patterns • automatically • with limited resources? C. Faloutsos
Related workStatistics: Time series forecasting • Main problem: “[…] The first step in the analysis of any time series is to plot the data [and inspect the graph]”[Brockwell 91] • Typically: • Resource intensive • Cannot update online • AR(I)MA and seasonal variants • ARFIMA, GARCH, … C. Faloutsos
Related workDatabases: Continuous Queries • Typically, different focus: • “Compression” • Not generative models • Largely orthogonal problem… • Gilbert, Guha, Indyk et al. (STOC 2002) • Garofalakis, Gibbons (SIGMOD 2002) • Chen, Dong, Han et al. (VLDB 2002); Bulut, Singh (ICDE 2003) • Gehrke, Korn, et al. (SIGMOD 2001), Dobra, Garofalakis, Gehrke et al. (SIGMOD 2002) • Guha, Koudas (ICDE 2003) Datar, Gionis, Indyk et al. (SODA 2002) • Madden+ [SIGMOD02], [SIGMOD03] C. Faloutsos
Goals • Adapt and handle arbitrary periodic components • No human intervention/tuning Also: • Single pass over the data • Limited memory (logarithmic) • Constant-time update C. Faloutsos
Outline Introduction - motivation Problem #1: Stream Mining Motivation Main idea Experimental results Problem #2: Graphs & Virus propagation Conclusions C. Faloutsos
xt t I1 I8 I2 I7 I3 I4 I5 I6 t t t t t t t t Wavelets“Straight” signal C. Faloutsos time
W1,3 W1,1 W1,4 W1,2 t t t t xt W2,1 W2,2 t t t W3,1 t V4,1 t WaveletsIntroduction – Haar frequency C. Faloutsos time
So? Wavelets compress many real signals well… Image compression and processing Vision; Astronomy, seismology, … Wavelet coefficients can be updated as new points arrive [Kotidis+] Wavelets C. Faloutsos
W1,3 t W1,1 W1,4 W1,2 t t t t W2,1 W2,2 = t t W3,1 t V4,1 t WaveletsCorrelations xt frequency C. Faloutsos time
W1,3 t W1,1 W1,4 W1,2 t t t t W2,1 W2,2 t t W3,1 t V4,1 t WaveletsCorrelations xt frequency C. Faloutsos time
Main ideaCorrelations • Wavelets are good… • …we can do even better • One number… • …and the fact that they are equal/correlated C. Faloutsos
Wl,t-2 Wl,t-1 Wl,t Wl’,t’-2 Wl’,t’-1 Proposed method Wl,t l,1Wl,t-1l,2Wl,t-2 … Wl’,t’ l’,1Wl’,t’-1l’,2Wl’,t’-2 … Wl’,t’ Small windows suffice… (k~4) C. Faloutsos
More details… • Update of wavelet coefficients • Update of linear models • Feature selection • Not all correlations are significant • Throw away the insignificant ones • very important!! [see paper] (incremental) (incremental; RLS) (single-pass) C. Faloutsos
SKIP Complexity • Model update Space:OlgN + mk2 OlgN Time:Ok2 O1 Where • N: number of points (so far) • k: number of regression coefficients; fixed • m: number of linear models; OlgN [see paper] C. Faloutsos
Outline Introduction - motivation Problem #1: Stream Mining Motivation Main idea Experimental results Problem #2: Graphs & Virus propagation Conclusions C. Faloutsos
Setup • First half used for model estimation • Models applied forward to forecast entire second half • AR, Seasonal AR (SAR): R • Simplest possible estimation – no maximum likelihood estimation (MLE), etc. • … vs. Python scripts C. Faloutsos
ResultsSynthetic data – Triangle pulse • Triangle pulse • AR captures wrong trend (or none) • Seasonal AR (SAR) estimation fails C. Faloutsos
ResultsSynthetic data – Mix • Mix (sine + square pulse) • AR captures wrong trend (or none) • Seasonal AR estimation fails C. Faloutsos
ResultsReal data – Automobile • Automobile traffic • Daily periodicity with rush-hour peaks • Bursty “noise” at smaller time scales (filtered) C. Faloutsos
ResultsReal data – Automobile • Automobile traffic • Daily periodicity with rush-hour peaks • Bursty “noise” at smaller time scales • AR fails to capture any trend (average) • Seasonal AR estimation fails C. Faloutsos
ResultsReal data – Automobile • Automobile traffic • Daily periodicity with rush-hour peaks • Bursty “noise” at smaller time scales • AWSOM spots periodicities, automatically C. Faloutsos
ResultsReal data – Automobile • Automobile traffic • Daily periodicity with rush-hour peaks • Bursty “noise” at smaller time scales • Generation with identified noise C. Faloutsos
ResultsReal data – Sunspot • Sunspot intensity – Slightly time-varying “period” • AR captures wrong trend (average) • Seasonal ARIMA • Captures immediate wrong downward trend • Requires human to determine seasonal component period (fixed) C. Faloutsos
ResultsReal data – Sunspot • Sunspot intensity – Slightly time-varying “period” Estimation: 40 minutes (R) vs. 9 seconds (Python) C. Faloutsos
~ 1 hour SKIP Variance • Variance (log-power) vs. scale: • “Noise” diagnostic (if decreasing linear…) • Can use to estimate noise parameters ~Hurst exponent C. Faloutsos
Running time time (t) stream size (N) C. Faloutsos
Space requirements Equal total number of model parameters C. Faloutsos
Conclusion • Adapt and handle arbitrary periodic components • No human intervention/tuning • Single pass over the data • Limited memory (logarithmic) • Constant-time update C. Faloutsos
Conclusion • Adapt and handle arbitrary periodic components • No human intervention/tuning • Single pass over the data • Limited memory (logarithmic) • Constant-time update no human limited resources C. Faloutsos
Outline Introduction - motivation Problem #1: Streams Problem #2: Graphs & Virus propagation Motivation & problem definition Related work Main idea Experiments Conclusions C. Faloutsos
Introduction Protein Interactions [genomebiology.com] Internet Map [lumeta.com] Food Web [Martinez ’91] ► Graphs are ubiquitious Friendship Network [Moody ’01] C. Faloutsos
Introduction “bridges” • What can we do with graph analysis? • Immunization; • Information Dissemination • network value of a customer [Domingos+] “Needle exchange” networks of drug users[Weeks et al. 2002] C. Faloutsos
Problem definition • Q1: How does a virus spread across an arbitrary network? • Q2: will it create an epidemic? • (in a sensor setting, with a ‘gossip’ protocol, will a rumor/query spread?) C. Faloutsos
Infected by neighbor Susceptible/ healthy Infected & infectious Cured internally Framework • Susceptible-Infected-Susceptible (SIS) model • Cured nodes immediately become susceptible C. Faloutsos
Prob. δ Prob. β Prob. β The model • (virus) Birth rate β : probability than an infected neighbor attacks • (virus) Death rate δ : probability that an infected node heals Healthy N2 N1 N Infected N3 C. Faloutsos
Epidemic threshold t Defined as the value of t, such that if b / d < t an epidemic can not happen Thus, • given a graph • compute its epidemic threshold C. Faloutsos
Epidemic threshold t What should t depend on? • avg. degree? and/or highest degree? • and/or variance of degree? • and/or determinant of the adjacency matrix? C. Faloutsos
Basic Homogeneous Model Homogeneous graphs [Kephart-White ’91, ’93] • Epidemic threshold = 1/<k> • Homogeneous connectivity <k>, ie, all nodes have ~same degree unrealistic C. Faloutsos
Power-law Networks • Model for Barabási-Albert networks • [Pastor-Satorras & Vespignani, ’01, ’02] • Epidemic threshold = <k> / <k2> • for BA type networks, with onlyγ = 3 (γ = slope of power-law exponent) C. Faloutsos
Epidemic threshold • Homogeneous graphs: 1/<k> • BA (g=3) <k> / <k2> • more complicated graphs ? • arbitrary, REAL graphs ? • how many parameters?? C. Faloutsos
Epidemic threshold • [Theorem] We have no epidemic, if β/δ <τ= 1/ λ1,A C. Faloutsos
Epidemic threshold • [Theorem] We have no epidemic, if epidemic threshold recovery prob. β/δ <τ= 1/ λ1,A largest eigenvalue of adj. matrix A attack prob. Proof: [Wang+03] C. Faloutsos
Epidemic threshold for various networks • sanity checks / older results: • Homogeneous networks • λ1,A = <k>; τ = 1/<k> • where <k> = average degree • This is the same result as of Kephart & White ! C. Faloutsos