1 / 53

Hidden Hazards: Finding Missing Nodes in Large Graph Epidemics

This research investigates the problem of finding missing infected nodes in large-scale graph epidemics using a minimum description length principle approach. The study focuses on the identification of patient zeros and the correction of missing data in order to improve epidemiology and public health surveillance.

cainsworth
Download Presentation

Hidden Hazards: Finding Missing Nodes in Large Graph Epidemics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hidden Hazards: Finding Missing Nodes in Large Graph Epidemics ShashidharSundareisanVirginia Tech JillesVreekenMax Planck Institute B. Aditya Prakash Virginia Tech SDM, Vancouver May 1, 2015

  2. Contagions • Social collaboration • Information Diffusion • Viral Marketing • Epidemiology and Public Health • Cyber Security • Human mobility • Games and Virtual Worlds • Ecology • Localized effects: riots… Sundareisan, Vreeken, Prakash 2015

  3. Virus Propagation • Susceptible-Infected (SI) Model [AJPH 2007] β CDC data: Visualization of the first 35 tuberculosis (TB) patients and their 1039 contacts Diseases over contact networks Sundareisan, Vreeken, Prakash 2015

  4. Culprits Motivation • Patient zeroes • Who started the epidemic? • Rumors • Who started the rumor? Sundareisan, Vreeken, Prakash 2015

  5. But: Real data is noisy! We don’t know who exactly are infected • Epidemiology • Public-health surveillance CDC Lab Hospital Not sure ? CNN headlines Surveillance Pyramid [Nishiura+, PLoS ONE 2011] ? Not sure Each level has a certain probability to miss some truly infected people Sundareisan, Vreeken, Prakash 2015

  6. Real data is noisy! Correcting missing data is by itself very important • Social Media • Twitter: due to the uniform samples [Morstatter+, ICWSM 2013], the relevant ‘infected’ tweets may be missed Tweets Missing ? Sampled Tweets ? Missing Sampling Sundareisan, Vreeken, Prakash 2015

  7. Outline • Motivation---Introduction • Problem Definition • Our Approach • Experiments • Conclusion Sundareisan, Vreeken, Prakash 2015

  8. The Problem • GIVEN: • Graph G(V, E) from historical data • Infected set D V, sampled (p%) and incomplete • Infectivity β of the virus (assumed to follow the SI model) • FIND: • Seed set i.e. patient zeros/culprits • Set C- (the missing infected nodes) • Ripple R (the order of infections) Sundareisan, Vreeken, Prakash 2015

  9. Related Work – Culprits (Partial) • Shah and Zaman, IEEE TIT, 2011 • One seed. • Provably finds MLE seed for d-regular trees • SI process • Lappas et. al., KDD, 2010. • k seeds (takes in Input k) • Infected graph assumed to be in steady-state • IC model • Prakash et. al., ICDM, 2012. (NetSleuth) • Finds number of seeds automatically. • Assumes no mistakes in infected set D. Sundareisan, Vreeken, Prakash 2015

  10. Related Work – Missing Nodes (Partial) • Costenbader and Valente 2003; Kossinets 2006, Borgatti et al. 2006 • Study the effect of sampling on macro levelnetworkstatistics • Adiga et. al. 2013 • Sensitivity of total infections to noise in network structure • Sadikov et al., WSDM, 2011 • correct for sampling for macro level cascade statistics Sundareisan, Vreeken, Prakash 2015

  11. Outline • Motivation---Introduction • Problem Definition • Our Approach • MDL • Decoupling • Finding S given C • Finding C given S • Experiments • Conclusion Sundareisan, Vreeken, Prakash 2015

  12. MDL-Minimum Description Length Principle • Occam’s Razor • Simplest model is the best model • “Induction by Compression” • Related to Bayesian approaches • MDL cost in bits = Model cost + Data cost • Best model least cost in bits Data + Model Channel Sundareisan, Vreeken, Prakash 2015 Sender Receiver

  13. MDL Encoding For Our Problem The Model Seeds (S), Ripple (R) Missing nodes (C-) Sender Receiver Graph G(V, E) Infectivity (β) Sampling (p) Seeds (S) Infected set (D C-) Ripple (R) Missing nodes (C-) Graph G(V, E) Infectivity (β) Sampling (p) Data given model Sundareisan, Vreeken, Prakash 2015

  14. Model (S, R) Cost • Scoring the seed set (S) • Scoring the ripple? Number of possible |S|-sized sets En-coding integer |S| Sundareisan, Vreeken, Prakash 2015

  15. Model (S, R) Cost • Scoring a ripple (R) Infected Snapshot Original Graph Ripple R1 Ripple R2 Sundareisan, Vreeken, Prakash 2015

  16. Model (S, R) Cost • Ripple cost Ripple R How the ‘frontier’ advances How long is the ripple Sundareisan, Vreeken, Prakash 2015

  17. Cost of the data (C-) • We have to transmit the missed nodes C- (green nodes) • So that receiver can recover D Detail:γ = 1 – p i.e. the probability of a node to be truly missing Sundareisan, Vreeken, Prakash 2015

  18. Total MDL Cost • Finally • And our problem is now • Find S, R, C- to minimize it Sundareisan, Vreeken, Prakash 2015

  19. Outline • Motivation---Introduction • Problem Definition • Our Approach • MDL • Decoupling • Finding S given C • Finding C given S • Experiments • Conclusion Sundareisan, Vreeken, Prakash 2015

  20. Our Approach: Decoupling • The two problems are • Finding seeds/ripple (S, R) • Finding Missing nodes (C-) • Can we decouple them? Sundareisan, Vreeken, Prakash 2015

  21. Decoupling the problems (contd.) • Finding seeds depends on missing nodes. Legend Missing nodes Seed Infected node NetSleuth: correct missing nodes filled in as input NetSleuth: No missing nodes as Input Sundareisan, Vreeken, Prakash 2015

  22. Decoupling the problems (cont.) • Finding missing nodes also depends on seeds. Not Infected Infected Most probably A was missed B Seed S A Sundareisan, Vreeken, Prakash 2015

  23. Outline • Motivation---Introduction • Problem Definition • Our Approach • MDL • Decoupling • Finding S given C • Finding C given S • Experiments • Conclusion Sundareisan, Vreeken, Prakash 2015

  24. Finding missing nodes (S) and culprits (C-) • Suppose an oracle gives us the missing nodes (C-) • We have complete infected set (D U C-) • Apply NetSleuth directly • NO SAMPLING INVOLVED • Will give us the seed set Legend Missing nodes Seed Infected node * Prakash et. al., ICDM 2012 Applying NetSleuth* on Oracle’s Answer Sundareisan, Vreeken, Prakash 2015

  25. Outline • Motivation---Introduction • Problem Definition • Our Approach • MDL • Decoupling • Finding S given C • Finding C given S • Experiments • Conclusion Sundareisan, Vreeken, Prakash 2015

  26. Missing Nodes (C-) given (S) • Oracle gives us S, find C- • Naïve Approach? • Find all possible C- • Pick the best one according to MDL • Infeasible! ( sets) Sundareisan, Vreeken, Prakash 2015

  27. Our Approach • Sub-problem 1: |Seeds| = 1 • |Missing nodes| = 1 • Sub-problem 2: Finding the right number of missing nodes. • Sub-problem 3: |Seeds| > 1 Sundareisan, Vreeken, Prakash 2015

  28. Sub Problem 1: Best hidden hazard given one seed • Best node is one which makes the Seed s more likely • We use empirical risk as the measure • Sanity Check: ideally risk should be 0 • So best hidden hazard, Sundareisan, Vreeken, Prakash 2015

  29. Sub-Problem 1: Best Hidden Hazard • Using some results in Prakash et. al. 2012 (see details in paper), we can rewrite it as u1 is the eigenvector corresponding to the smallest eigenvalue of the Laplaciansubmatrixof D Sundareisan, Vreeken, Prakash 2015

  30. Detour: LaplacianSubmatrix • Laplacian = Deg(G) – A(G) • LD = take only rows fornodes in D (Laplaciansubmatrix) • u1 (smallest eigenvalue’s eigenvector) Laplacian Degree Adjacency Laplacian LaplacianSubmatrix D ƛ Eigenvector

  31. Okay • How to solve this quickly? Proof Omitted: see paper Sundareisan, Vreeken, Prakash 2015

  32. Best hidden hazard • Choose n* such Measures • how connected a node n is to centrally located infected nodes w.r.t. s in D • Depends on the seed as well as the structure Sundareisan, Vreeken, Prakash 2015

  33. Sub-Problem 2: How many missing nodes? • MDL? • Add nodes based on Z-scores till MDL increases. • MDL is not convex! • But it has convex like behavior….. Sundareisan, Vreeken, Prakash 2015

  34. Sub-Problem - 3: What if |Seeds| > 1 Using z-scores: Missing nodes are near one seed Ideal: Missing nodes near both seeds Sundareisan, Vreeken, Prakash 2015

  35. Sub problem 3: What if |Seeds| > 1 • Exonerate previous seeds • Make previous seeds uninfected and calculate u1 • The blame is transferred to the locality of the older seed • Complete Z-score = maxover all seeds Z-score (n) • Maximum as we need high quality missing nodes • Take nodes with top-k complete Z-scores Sundareisan, Vreeken, Prakash 2015

  36. Finding missing nodes given seeds Phew! Sundareisan, Vreeken, Prakash 2015

  37. The complete algorithm – NetFill (Outline) Running time: sub-quadratic in practice Sundareisan, Vreeken, Prakash 2015

  38. Outline • Motivation---Introduction • Problem Definition • Our Approach • Experiments • Conclusion Sundareisan, Vreeken, Prakash 2015

  39. Datasets • Real and Synthetic graphs. • Real and Simulated cascades. • Graphs • GRID • AS-OREGON • FLIXSTER • a fridendship network with movie ratings • Cascade: the same movie rating from friends • MEME-TRACKER • hl-mt and hl-hl Gomez-Rodriguez et. al. 2010 Sundareisan, Vreeken, Prakash 2015

  40. Baselines • NETSLEUTH • Simulation • Simulate the SI process till we reach D • Seeds = Input. • Missing nodes = I \ D. • Frontier • Nodes “next in line” to be infected. • At the boundary (frontier) of infected set. • Seeds = Find seeds given missing nodes ( NetSleuth on Frontier + data D) Sundareisan, Vreeken, Prakash 2015

  41. Visualizing Performance (Grid connected) NetSleuth Seeds Missing nodes Simulation Seeds Missing nodes Frontier Seeds Missing nodes NetFill Seeds Missing nodes Legend: Correct FP FN Seeds Infected Sundareisan, Vreeken, Prakash 2015

  42. MDL Grid: Finding the correct size of Missing nodes (automatically) Sundareisan, Vreeken, Prakash 2015

  43. Evaluation Metrics (Subtleties) • For the accuracy of C- (missing nodes) • Jaccard, precision, recall, f-measure do not consider TN. • MCC-Matthew’s correlation coefficient. Confusion matrix -1 <= MCC <= 1 Closer to 1 the better Sundareisan, Vreeken, Prakash 2015

  44. Evaluation Metrics (contd.) • For seeds (S) and ripple (R) • Q = MDL(algorithm) / MDL(true) • From literature (see paper for details) • Again, closer to 1 the better Sundareisan, Vreeken, Prakash 2015

  45. Grid-connected (Synthetic Graph, Synthetic Cascades) Closer to 1 the better Sundareisan, Vreeken, Prakash 2015

  46. AS-Oregon (Real Graph, Synthetic Cascades) Closer to 1 the better Sundareisan, Vreeken, Prakash 2015

  47. Meme-Tracker HL-MT(Real Graph, Real Cascades) See Paper for more experiments e.g. scalability, robustness etc. Closer to 1 the better Sundareisan, Vreeken, Prakash 2015

  48. Meme-Tracker– case study • 96,000 node graph for the meme “State of the economy” • Found missing websites like “www.nbcbayarea.com”, “chicagotribune.com” and some blog posts. Sundareisan, Vreeken, Prakash 2015

  49. Outline • Motivation---Introduction • Problem Definition • Our Approach • Experiments • Conclusion Sundareisan, Vreeken, Prakash 2015

  50. Conclusions • Given: Graph and sampled infections Find: missing infections and culprits • Formulated the problem • Using MDL • Two-stage alternating optimization • Find best seeds given missing nodes • Find best missing nodes given seeds • NetFill • Subquadratic (near-linear in many cases) • Outperforms baselines in real and synthetic data NetFill on a grid Sundareisan, Vreeken, Prakash 2015

More Related