1 / 63

Overview

Overview. Goal: scalable algorithms to find patterns and anomalies on graphs Mining Large Graphs: Algorithms, Inference, and Discoveries Spectral Analysis of Billion-Scale Graphs : Discoveries and Implementation Patterns on the Connected Components of Terabyte-Scale Graphs

betty_james
Download Presentation

Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Overview Goal: scalable algorithms to find patterns and anomalies on graphs Mining Large Graphs: Algorithms, Inference, and Discoveries Spectral Analysis of Billion-Scale Graphs: Discoveries and Implementation Patterns on the Connected Components of Terabyte-Scale Graphs PI: Christos Faloutsos (CMU) Students: Leman Akoglu, Polo Chau, U Kang 1

  2. Mining Large Graphs: Algorithms, Inference, and Discoveries U Kang Duen Horng Chau Christos Faloutsos School of Computer Science Carnegie Mellon University

  3. Outline • Problem Definition • Proposed Method • Experiment • Conclusion

  4. Motivation • Inference on graph: “guilt by association” • Adult sites tend to be connected to adult sites, while edu. sites are connected to educational ones • Given labels(adult or edu) on a subset of the nodes, infer the labels of other unlabeled nodes on graph • Tool: Belief Propagation(BP) blue nodes connected to blue nodes red nodes connected to red nodes 4

  5. Belief Propagation Belief computation Node belief Prior prob Messages from neighbors Message computation Propagation matrix Messsage from node i to node j Prior prob ~Messages from neighbors 5

  6. A Challenge in BP • Scalability! • Existing works assume that all the nodes (and/or edges) of the input graph fit in memory • Problem: what if the graph is too large to fit in memory? • Challenge: Scaling up the inference algorithm for very large graphs whose nodes do not fit in memory 6

  7. Problem Definition • How can we scale up the BP algorithm to very large graphs? • Goal • Scalability: to billions of nodes and edges • Efficiency: fast algorithm 7

  8. Outline • Problem Definition • Proposed Method • Experiment • Conclusion

  9. Main Idea • Our approach • Use Hadoop to scale-up BP • Challenge • How can we formulate BP using a simple, efficient operation supported by Hadoop? 9

  10. m01 m12 m24 m10 m21 m42 m13 m31 Main Idea • Key observation • BP message update equation = local message exchange A message is updated from its neighboring messages. For example, m12 is updated from m01 and m31 10

  11. Main Idea • BP message update can be expressed by a generalized matrix-vector multiplication on a line graphL(G) induced from the original graph G • Nodes in L(G) are edges in G • Two nodes in L(G) are connected if they are adjacent in G 11

  12. Proposed: HA-LFP algorithm • BP message update can be expressed by a generalized matrix-vector multiplication on a line graphL(G) induced from the original graph G Line graph of G Generalized m-v multiplication Multiply repeatedly until convergence New message vector Old message vector 12

  13. Complexity One Iteration of HA-LFP on L(G) One Matrix Vector Multiplication on G = Time : O((V+E) / M) Space: O(V + E) V : # of nodes E : # of nodes M : # of machines 13

  14. Outline • Problem Definition • Proposed Method • Experiment • Conclusion

  15. Questions Q1: How fast is HA-LFP? Q2: How does HA-LFP scale-up? Q3: How can we find `good’ and `bad’ sites in a web graph?

  16. Running Time Q1: How fast is HA-LFP? [10 iteration] 16

  17. Scale Up Q2: How does HA-LFP scale-up? Linear on the number of machines, edges 17

  18. Advantage of HA-LFP • Scalability • The only solution when the node information cannot fit in memory. • Near-linear scale up • Running Time • Faster than the single-machine, for large graphs • Fault Tolerance 18

  19. Analysis of Web Graph Q3: How can we find `good’ and `bad’ sites in a web graph? Pages whose goodness scores < 0.9 are likely to be adult pages 19

  20. Outline • Problem Definition • Proposed Method • Experiment • Conclusion

  21. Conclusion HA-LFP Belief Propgation for billion-scale graphs on Hadoop Near-linear scalability on # of machines, edges Many applications Finding `good’ and `bad’ web sites Fraud detection … 21

  22. Spectral Analysis of Billion-Scale Graphs: Discoveries and Implementation U Kang Brendan Meeder Christos Faloutsos School of Computer Science Carnegie Mellon University 22

  23. Outline Problem Definition Proposed Method Experiment Conclusion 23

  24. Problem Definition • Eigensolver • Computes top-k eigenvalues and eigenvectors • Application: • SVD, triangle counting, spectral clustering, … • Existing eigensolver • Can handle up to millions of nodes • How can we scale up eigensolvers to billion-scale graphs? 24

  25. Outline Problem Definition Proposed Method Experiment Conclusion 25

  26. Main Idea • HEigen algorithm (Hadoop Eigen-solver) • Selective parallelize ‘Lanczos’ algorithm • Expensive operation: on Hadoop for scalability • Inexpensive operation: on a single-machine for accuracy • Block encoding • Block encoding, and then do matrix-vector multiplication • Exploiting skewness in matrix-matrix mult. • In matrix-matrix multiplication when a matrix is very large and the other is very small 26

  27. Application of HEigen • Triangle Counting • Real social networks have a lot of triangles • Friends of friends are friends • But: triangles are expensive to compute • (3-way join; several approx. algos) • Q: Can we do that quickly? • A: Yes! • #triangles = 1/6 Sum ( λi3 ) • (and, because of skewness in eigenvalues, • we only need the top few eigenvalues!) [Tsourakakis ICDM 2008]

  28. Outline Problem Definition Proposed Method Experiment Conclusion 28

  29. Questions Q1: How does HEigen scale-up? Q2: Which Matrix-Matrix multiplication algorithm runs the fastest? Q3: How can we find anomalous sites in a web graph? 29

  30. Running Time Q1: How does HEigen scale-up? Heigen-BLOCK is faster than PLAIN ver. Linear on the number of machines, edges

  31. Scale Up Q2: Which Matrix-Matrix multiplication algorithm runs the fastest? Cache-based MM runs the fastest!

  32. Results Q3: How can we find anomalous sites in a web graph? • Triangle counting on Twitter social network [Twitter 2009; ~ 3 billion edges] • U.S. politicians: moderate number of triangles vs. degree • Adult sites: very large number of triangles vs. degree 32

  33. Outline Problem Definition Proposed Method Experiment Conclusion 33

  34. Conclusion HEigen Eigensolver for billion-scale graphs on Hadoop Near-linear scalability on # of machines, edges Cache-based Matrix-Matrix multiplication: fastest! Anomalies in triangle counts Many applications Triangle counting SVD … 34

  35. Patterns on the Connected Components of Terabyte-Scale Graphs U Kang* Mary McGlohon*† Leman Akoglu* Christos Faloutsos* (*) SCS, Carnegie Mellon University (†) Google 35

  36. Outline • Problem Definition • Static Patterns • Evolution Patterns • Model • Conclusion 36

  37. A large graph is composed of many connected components Problem Definition YahooWeb graph |V| = 1.4 billion |E| = 6.7 billion 120 GBytes Count Size Q1: static patterns? Q2: evolution patterns? Q3: model? 37

  38. Outline • Problem Definition • Static Patterns • Evolution Patterns • Model • Conclusion 38

  39. Q1: Static Patterns • What are the regularities in the connected components of a static graph? • How do they look like? • Do the GCC and the other connected components look similar? Chain? Clique? Idea: use ‘density’ and ‘radius’ to find patterns 39

  40. Density of Connected Component • What is a good metric for the density of a connected component? • A candidate: |E| / |V| (“average degree”) • Problem: it increases over time Number of Edges Number of Nodes 40

  41. Density of Connected Component • We want a metric that can measure the ‘intrinsic’ density of a component • Proposed: Graph Fractal Dimension(GFD) • log |E| / log |V| Number of Edges Number of Edges [Leskovec+ KDD05] Number of Nodes Number of Nodes 41

  42. Density of Connected Component • Graph Fractal Dimension(GFD) • log |E| / log |V| Chain: GFD ~1 Star: GFD ~1 Bipartite Core: 1 < GFD < 2 Clique: GFD ~2 42

  43. Density of Connected Component What are the GFDs of connected components in a large, real graph? 43

  44. Density of Connected Component • GFDs of CCs in YahooWeb graph Number of Edges Number of Edges Slope= 1.08 Number of Nodes Number of Nodes GFDs of CCs are constant on average GFDs of CCs are slightly denser than the tree 44

  45. Radius of Connected Component Q1.1: What does the GCC look like? Q1.2: What do the rest CC’s look like? ( What are the GFDs?) 45

  46. Slope= 1.38 Radius of Connected Component • What are the patterns of radii in connected components? Avg. Max. Radius Max. Core Chain Average Radius A1.1: GCC looks like a ‘kite’ A1.2: Chain-like disconnected components 46

  47. Outline • Problem Definition • Static Patterns • Evolution Patterns • Model • Conclusion 47

  48. Q2: Evolution Patterns • How do the connected components evolve? • Do largest connected components grow with the same rate? • How often does a newcomer join the disconnected components? newcomer ? ? 48

  49. Gelling Point • Gelling Point [McGlohon+ KDD08] • Diameter starts to shrink 49

  50. Growth of Connected Component • GFDs of Top 3 CC’s over time Before “gelling point”: GFDs of Top 3 CC’s stay constant, “tree” like. After “deviation point”: GFD of GCC takes off, becomes denser. 50

More Related