overview n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Overview PowerPoint Presentation
Download Presentation
Overview

Loading in 2 Seconds...

play fullscreen
1 / 63
betty_james

Overview - PowerPoint PPT Presentation

151 Views
Download Presentation
Overview
An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Overview Goal: scalable algorithms to find patterns and anomalies on graphs Mining Large Graphs: Algorithms, Inference, and Discoveries Spectral Analysis of Billion-Scale Graphs: Discoveries and Implementation Patterns on the Connected Components of Terabyte-Scale Graphs PI: Christos Faloutsos (CMU) Students: Leman Akoglu, Polo Chau, U Kang 1

  2. Mining Large Graphs: Algorithms, Inference, and Discoveries U Kang Duen Horng Chau Christos Faloutsos School of Computer Science Carnegie Mellon University

  3. Outline • Problem Definition • Proposed Method • Experiment • Conclusion

  4. Motivation • Inference on graph: “guilt by association” • Adult sites tend to be connected to adult sites, while edu. sites are connected to educational ones • Given labels(adult or edu) on a subset of the nodes, infer the labels of other unlabeled nodes on graph • Tool: Belief Propagation(BP) blue nodes connected to blue nodes red nodes connected to red nodes 4

  5. Belief Propagation Belief computation Node belief Prior prob Messages from neighbors Message computation Propagation matrix Messsage from node i to node j Prior prob ~Messages from neighbors 5

  6. A Challenge in BP • Scalability! • Existing works assume that all the nodes (and/or edges) of the input graph fit in memory • Problem: what if the graph is too large to fit in memory? • Challenge: Scaling up the inference algorithm for very large graphs whose nodes do not fit in memory 6

  7. Problem Definition • How can we scale up the BP algorithm to very large graphs? • Goal • Scalability: to billions of nodes and edges • Efficiency: fast algorithm 7

  8. Outline • Problem Definition • Proposed Method • Experiment • Conclusion

  9. Main Idea • Our approach • Use Hadoop to scale-up BP • Challenge • How can we formulate BP using a simple, efficient operation supported by Hadoop? 9

  10. m01 m12 m24 m10 m21 m42 m13 m31 Main Idea • Key observation • BP message update equation = local message exchange A message is updated from its neighboring messages. For example, m12 is updated from m01 and m31 10

  11. Main Idea • BP message update can be expressed by a generalized matrix-vector multiplication on a line graphL(G) induced from the original graph G • Nodes in L(G) are edges in G • Two nodes in L(G) are connected if they are adjacent in G 11

  12. Proposed: HA-LFP algorithm • BP message update can be expressed by a generalized matrix-vector multiplication on a line graphL(G) induced from the original graph G Line graph of G Generalized m-v multiplication Multiply repeatedly until convergence New message vector Old message vector 12

  13. Complexity One Iteration of HA-LFP on L(G) One Matrix Vector Multiplication on G = Time : O((V+E) / M) Space: O(V + E) V : # of nodes E : # of nodes M : # of machines 13

  14. Outline • Problem Definition • Proposed Method • Experiment • Conclusion

  15. Questions Q1: How fast is HA-LFP? Q2: How does HA-LFP scale-up? Q3: How can we find `good’ and `bad’ sites in a web graph?

  16. Running Time Q1: How fast is HA-LFP? [10 iteration] 16

  17. Scale Up Q2: How does HA-LFP scale-up? Linear on the number of machines, edges 17

  18. Advantage of HA-LFP • Scalability • The only solution when the node information cannot fit in memory. • Near-linear scale up • Running Time • Faster than the single-machine, for large graphs • Fault Tolerance 18

  19. Analysis of Web Graph Q3: How can we find `good’ and `bad’ sites in a web graph? Pages whose goodness scores < 0.9 are likely to be adult pages 19

  20. Outline • Problem Definition • Proposed Method • Experiment • Conclusion

  21. Conclusion HA-LFP Belief Propgation for billion-scale graphs on Hadoop Near-linear scalability on # of machines, edges Many applications Finding `good’ and `bad’ web sites Fraud detection … 21

  22. Spectral Analysis of Billion-Scale Graphs: Discoveries and Implementation U Kang Brendan Meeder Christos Faloutsos School of Computer Science Carnegie Mellon University 22

  23. Outline Problem Definition Proposed Method Experiment Conclusion 23

  24. Problem Definition • Eigensolver • Computes top-k eigenvalues and eigenvectors • Application: • SVD, triangle counting, spectral clustering, … • Existing eigensolver • Can handle up to millions of nodes • How can we scale up eigensolvers to billion-scale graphs? 24

  25. Outline Problem Definition Proposed Method Experiment Conclusion 25

  26. Main Idea • HEigen algorithm (Hadoop Eigen-solver) • Selective parallelize ‘Lanczos’ algorithm • Expensive operation: on Hadoop for scalability • Inexpensive operation: on a single-machine for accuracy • Block encoding • Block encoding, and then do matrix-vector multiplication • Exploiting skewness in matrix-matrix mult. • In matrix-matrix multiplication when a matrix is very large and the other is very small 26

  27. Application of HEigen • Triangle Counting • Real social networks have a lot of triangles • Friends of friends are friends • But: triangles are expensive to compute • (3-way join; several approx. algos) • Q: Can we do that quickly? • A: Yes! • #triangles = 1/6 Sum ( λi3 ) • (and, because of skewness in eigenvalues, • we only need the top few eigenvalues!) [Tsourakakis ICDM 2008]

  28. Outline Problem Definition Proposed Method Experiment Conclusion 28

  29. Questions Q1: How does HEigen scale-up? Q2: Which Matrix-Matrix multiplication algorithm runs the fastest? Q3: How can we find anomalous sites in a web graph? 29

  30. Running Time Q1: How does HEigen scale-up? Heigen-BLOCK is faster than PLAIN ver. Linear on the number of machines, edges

  31. Scale Up Q2: Which Matrix-Matrix multiplication algorithm runs the fastest? Cache-based MM runs the fastest!

  32. Results Q3: How can we find anomalous sites in a web graph? • Triangle counting on Twitter social network [Twitter 2009; ~ 3 billion edges] • U.S. politicians: moderate number of triangles vs. degree • Adult sites: very large number of triangles vs. degree 32

  33. Outline Problem Definition Proposed Method Experiment Conclusion 33

  34. Conclusion HEigen Eigensolver for billion-scale graphs on Hadoop Near-linear scalability on # of machines, edges Cache-based Matrix-Matrix multiplication: fastest! Anomalies in triangle counts Many applications Triangle counting SVD … 34

  35. Patterns on the Connected Components of Terabyte-Scale Graphs U Kang* Mary McGlohon*† Leman Akoglu* Christos Faloutsos* (*) SCS, Carnegie Mellon University (†) Google 35

  36. Outline • Problem Definition • Static Patterns • Evolution Patterns • Model • Conclusion 36

  37. A large graph is composed of many connected components Problem Definition YahooWeb graph |V| = 1.4 billion |E| = 6.7 billion 120 GBytes Count Size Q1: static patterns? Q2: evolution patterns? Q3: model? 37

  38. Outline • Problem Definition • Static Patterns • Evolution Patterns • Model • Conclusion 38

  39. Q1: Static Patterns • What are the regularities in the connected components of a static graph? • How do they look like? • Do the GCC and the other connected components look similar? Chain? Clique? Idea: use ‘density’ and ‘radius’ to find patterns 39

  40. Density of Connected Component • What is a good metric for the density of a connected component? • A candidate: |E| / |V| (“average degree”) • Problem: it increases over time Number of Edges Number of Nodes 40

  41. Density of Connected Component • We want a metric that can measure the ‘intrinsic’ density of a component • Proposed: Graph Fractal Dimension(GFD) • log |E| / log |V| Number of Edges Number of Edges [Leskovec+ KDD05] Number of Nodes Number of Nodes 41

  42. Density of Connected Component • Graph Fractal Dimension(GFD) • log |E| / log |V| Chain: GFD ~1 Star: GFD ~1 Bipartite Core: 1 < GFD < 2 Clique: GFD ~2 42

  43. Density of Connected Component What are the GFDs of connected components in a large, real graph? 43

  44. Density of Connected Component • GFDs of CCs in YahooWeb graph Number of Edges Number of Edges Slope= 1.08 Number of Nodes Number of Nodes GFDs of CCs are constant on average GFDs of CCs are slightly denser than the tree 44

  45. Radius of Connected Component Q1.1: What does the GCC look like? Q1.2: What do the rest CC’s look like? ( What are the GFDs?) 45

  46. Slope= 1.38 Radius of Connected Component • What are the patterns of radii in connected components? Avg. Max. Radius Max. Core Chain Average Radius A1.1: GCC looks like a ‘kite’ A1.2: Chain-like disconnected components 46

  47. Outline • Problem Definition • Static Patterns • Evolution Patterns • Model • Conclusion 47

  48. Q2: Evolution Patterns • How do the connected components evolve? • Do largest connected components grow with the same rate? • How often does a newcomer join the disconnected components? newcomer ? ? 48

  49. Gelling Point • Gelling Point [McGlohon+ KDD08] • Diameter starts to shrink 49

  50. Growth of Connected Component • GFDs of Top 3 CC’s over time Before “gelling point”: GFDs of Top 3 CC’s stay constant, “tree” like. After “deviation point”: GFD of GCC takes off, becomes denser. 50