1 / 118

Large Graph Mining – Patterns, Tools and Cascade analysis

Large Graph Mining – Patterns, Tools and Cascade analysis. Christos Faloutsos CMU. Thank you!. Michael Katehakis Hui Xiong Spiros Papadimitriou Luz Kosar. Roadmap. Introduction – Motivation Why ‘big data’ Why (big) graphs? Problem#1: Patterns in graphs Problem#2: Tools

haines
Download Presentation

Large Graph Mining – Patterns, Tools and Cascade analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Large Graph Mining –Patterns, Tools and Cascade analysis Christos Faloutsos CMU

  2. Thank you! • Michael Katehakis • HuiXiong • Spiros Papadimitriou • Luz Kosar C. Faloutsos (CMU)

  3. Roadmap • Introduction – Motivation • Why ‘big data’ • Why (big) graphs? • Problem#1: Patterns in graphs • Problem#2: Tools • Problem#3: Scalability • Conclusions C. Faloutsos (CMU)

  4. Why ‘big data’ • Why? • What is the problem definition? C. Faloutsos (CMU)

  5. Main message:Big data: often > experts • ‘Super Crunchers’ Why Thinking-By-Numbers is the New Way To Be Smart by Ian Ayres, 2008 • Google won the machine translation competition 2005 • http://www.itl.nist.gov/iad/mig//tests/mt/2005/doc/mt05eval_official_results_release_20050801_v3.html C. Faloutsos (CMU)

  6. Problem definition – big picture Tera/Peta-byte data Analytics Insights, outliers C. Faloutsos (CMU)

  7. Problem definition – big picture Tera/Peta-byte data Analytics Insights, outliers Main emphasis in this talk C. Faloutsos (CMU)

  8. Roadmap • Introduction – Motivation • Why ‘big data’ • Why (big) graphs? • Problem#1: Patterns in graphs • Problem#2: Tools • Problem#3: Scalability • Conclusions C. Faloutsos (CMU)

  9. Graphs - why should we care? Food Web [Martinez ’91] >$10B revenue >0.5B users Internet Map [lumeta.com] C. Faloutsos (CMU)

  10. T1 D1 ... ... DN TM Graphs - why should we care? • IR: bi-partite graphs (doc-terms) • web: hyper-text graph • ... and more: C. Faloutsos (CMU)

  11. Graphs - why should we care? • web-log (‘blog’) news propagation • computer network security: email/IP traffic and anomaly detection • ‘viral’ marketing • Supplier-supply business chains (-> instabilities) • .... • Subject-verb-object -> graph • Many-to-many db relationship -> graph C. Faloutsos (CMU)

  12. Outline • Introduction – Motivation • Problem#1: Patterns in graphs • Static graphs • Weighted graphs • Time evolving graphs • Problem#2: Tools • Problem#3: Scalability • Conclusions C. Faloutsos (CMU)

  13. Problem #1 - network and graph mining • What does the Internet look like? • What does FaceBook look like? • What is ‘normal’/‘abnormal’? • which patterns/laws hold? C. Faloutsos (CMU)

  14. Problem #1 - network and graph mining • What does the Internet look like? • What does FaceBook look like? • What is ‘normal’/‘abnormal’? • which patterns/laws hold? • To spot anomalies (rarities), we have to discover patterns C. Faloutsos (CMU)

  15. Problem #1 - network and graph mining • What does the Internet look like? • What does FaceBook look like? • What is ‘normal’/‘abnormal’? • which patterns/laws hold? • To spot anomalies (rarities), we have to discover patterns • Large datasets reveal patterns/anomalies that may be invisible otherwise… C. Faloutsos (CMU)

  16. Graph mining • Are real graphs random? C. Faloutsos (CMU)

  17. Laws and patterns • Are real graphs random? • A: NO!! • Diameter • in- and out- degree distributions • other (surprising) patterns • So, let’s look at the data C. Faloutsos (CMU)

  18. Solution# S.1 • Power law in the degree distribution [SIGCOMM99] internet domains att.com log(degree) ibm.com log(rank) C. Faloutsos (CMU)

  19. -0.82 Solution# S.1 • Power law in the degree distribution [SIGCOMM99] internet domains att.com log(degree) ibm.com log(rank) C. Faloutsos (CMU)

  20. Solution# S.2: Eigen Exponent E • A2: power law in the eigenvalues of the adjacency matrix Eigenvalue Exponent = slope E = -0.48 May 2001 Rank of decreasing eigenvalue C. Faloutsos (CMU)

  21. Many more power laws • # of sexual contacts • Income [Pareto] –’80-20 distribution’ • Duration of downloads [Bestavros+] • Duration of UNIX jobs (‘mice and elephants’) • Size of files of a user • … • ‘Black swans’ C. Faloutsos (CMU)

  22. Roadmap • Introduction – Motivation • Problem#1: Patterns in graphs • Static graphs • degree, diameter, eigen, • triangles • cliques • Weighted graphs • Time evolving graphs • Problem#2: Tools C. Faloutsos (CMU)

  23. Solution# S.3: Triangle ‘Laws’ Real social networks have a lot of triangles C. Faloutsos (CMU)

  24. Solution# S.3: Triangle ‘Laws’ Real social networks have a lot of triangles Friends of friends are friends Any patterns? C. Faloutsos (CMU)

  25. Triangle Law: #S.3 [Tsourakakis ICDM 2008] Reuters SN X-axis: degree Y-axis: mean # triangles n friends -> ~n1.6 triangles Epinions C. Faloutsos (CMU)

  26. Triangle Law: Computations [Tsourakakis ICDM 2008] details But: triangles are expensive to compute (3-way join; several approx. algos) – O(dmax2) Q: Can we do that quickly? A: C. Faloutsos (CMU)

  27. Triangle Law: Computations [Tsourakakis ICDM 2008] details But: triangles are expensive to compute (3-way join; several approx. algos) – O(dmax2) Q: Can we do that quickly? A: Yes! #triangles = 1/6 Sum ( li3 ) (and, because of skewness (S2) , we only need the top few eigenvalues! - O(E) C. Faloutsos (CMU)

  28. Triangle Law: Computations [Tsourakakis ICDM 2008] details 1000x+ speed-up, >90% accuracy C. Faloutsos (CMU)

  29. Triangle counting for large graphs? ? ? ? Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD’11] C. Faloutsos (CMU) 29

  30. Triangle counting for large graphs? Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD’11] C. Faloutsos (CMU) 30

  31. Triangle counting for large graphs? Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD’11] C. Faloutsos (CMU) 31

  32. Triangle counting for large graphs? Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD’11] C. Faloutsos (CMU) 32

  33. Roadmap • Introduction – Motivation • Problem#1: Patterns in graphs • Static graphs • Weighted graphs • Time evolving graphs • Problem#2: Tools • … C. Faloutsos (CMU)

  34. Problem: Time evolution • with Jure Leskovec (CMU -> Stanford) • and Jon Kleinberg (Cornell – sabb. @ CMU) C. Faloutsos (CMU)

  35. T.1 Evolution of the Diameter • Prior work on Power Law graphs hints atslowly growing diameter: • diameter ~ O(log N) • diameter ~ O(log log N) • What is happening in real data? C. Faloutsos (CMU)

  36. T.1 Evolution of the Diameter • Prior work on Power Law graphs hints atslowly growing diameter: • diameter ~ O(log N) • diameter ~ O(log log N) • What is happening in real data? • Diameter shrinks over time C. Faloutsos (CMU)

  37. T.1 Diameter – “Patents” • Patent citation network • 25 years of data • @1999 • 2.9 M nodes • 16.5 M edges diameter time [years] C. Faloutsos (CMU)

  38. T.2 Temporal Evolution of the Graphs • N(t) … nodes at time t • E(t) … edges at time t • Suppose that N(t+1) = 2 * N(t) • Q: what is your guess for E(t+1) =? 2 * E(t) C. Faloutsos (CMU)

  39. T.2 Temporal Evolution of the Graphs • N(t) … nodes at time t • E(t) … edges at time t • Suppose that N(t+1) = 2 * N(t) • Q: what is your guess for E(t+1) =? 2 * E(t) • A: over-doubled! • But obeying the ``Densification Power Law’’ C. Faloutsos (CMU)

  40. T.2 Densification – Patent Citations • Citations among patents granted • @1999 • 2.9 M nodes • 16.5 M edges • Each year is a datapoint E(t) 1.66 N(t) C. Faloutsos (CMU)

  41. Roadmap • Introduction – Motivation • Problem#1: Patterns in graphs • Static graphs • Weighted graphs • Time evolving graphs • Problem#2: Tools • … C. Faloutsos (CMU)

  42. T.3 : popularity over time # in links lag: days after post 3 2 1 Post popularity drops-off – exponentially? @t @t + lag C. Faloutsos (CMU)

  43. T.3 : popularity over time # in links (log) days after post (log) Post popularity drops-off – exponentially? POWER LAW! Exponent? C. Faloutsos (CMU)

  44. T.3 : popularity over time # in links (log) -1.6 days after post (log) • Post popularity drops-off – exponentially? • POWER LAW! • Exponent? -1.6 • close to -1.5: Barabasi’s stack model • and like the zero-crossings of a random walk C. Faloutsos (CMU)

  45. -1.5 slope J. G. Oliveira & A.-L. Barabási Human Dynamics: The Correspondence Patterns of Darwin and Einstein. Nature437, 1251 (2005) . [PDF] Prob(RT > x) (log) Response time (log) C. Faloutsos (CMU)

  46. -1.5 slope J. G. Oliveira & A.-L. Barabási Human Dynamics: The Correspondence Patterns of Darwin and Einstein. Nature437, 1251 (2005) . [PDF] C. Faloutsos (CMU)

  47. Roadmap • Introduction – Motivation • Problem#1: Patterns in graphs • Problem#2: Tools • Belief Propagation • Tensors • Spike analysis • Problem#3: Scalability • Conclusions C. Faloutsos (CMU)

  48. E-bay Fraud detection w/ Polo Chau & Shashank Pandit, CMU [www’07] C. Faloutsos (CMU)

  49. E-bay Fraud detection C. Faloutsos (CMU)

  50. E-bay Fraud detection C. Faloutsos (CMU)

More Related