151 Views

Download Presentation
##### Overview

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Overview**Goal: scalable algorithms to find patterns and anomalies on graphs Mining Large Graphs: Algorithms, Inference, and Discoveries Spectral Analysis of Billion-Scale Graphs: Discoveries and Implementation Patterns on the Connected Components of Terabyte-Scale Graphs PI: Christos Faloutsos (CMU) Students: Leman Akoglu, Polo Chau, U Kang 1**Mining Large Graphs:**Algorithms, Inference, and Discoveries U Kang Duen Horng Chau Christos Faloutsos School of Computer Science Carnegie Mellon University**Outline**• Problem Definition • Proposed Method • Experiment • Conclusion**Motivation**• Inference on graph: “guilt by association” • Adult sites tend to be connected to adult sites, while edu. sites are connected to educational ones • Given labels(adult or edu) on a subset of the nodes, infer the labels of other unlabeled nodes on graph • Tool: Belief Propagation(BP) blue nodes connected to blue nodes red nodes connected to red nodes 4**Belief Propagation**Belief computation Node belief Prior prob Messages from neighbors Message computation Propagation matrix Messsage from node i to node j Prior prob ~Messages from neighbors 5**A Challenge in BP**• Scalability! • Existing works assume that all the nodes (and/or edges) of the input graph fit in memory • Problem: what if the graph is too large to fit in memory? • Challenge: Scaling up the inference algorithm for very large graphs whose nodes do not fit in memory 6**Problem Definition**• How can we scale up the BP algorithm to very large graphs? • Goal • Scalability: to billions of nodes and edges • Efficiency: fast algorithm 7**Outline**• Problem Definition • Proposed Method • Experiment • Conclusion**Main Idea**• Our approach • Use Hadoop to scale-up BP • Challenge • How can we formulate BP using a simple, efficient operation supported by Hadoop? 9**m01**m12 m24 m10 m21 m42 m13 m31 Main Idea • Key observation • BP message update equation = local message exchange A message is updated from its neighboring messages. For example, m12 is updated from m01 and m31 10**Main Idea**• BP message update can be expressed by a generalized matrix-vector multiplication on a line graphL(G) induced from the original graph G • Nodes in L(G) are edges in G • Two nodes in L(G) are connected if they are adjacent in G 11**Proposed: HA-LFP algorithm**• BP message update can be expressed by a generalized matrix-vector multiplication on a line graphL(G) induced from the original graph G Line graph of G Generalized m-v multiplication Multiply repeatedly until convergence New message vector Old message vector 12**Complexity**One Iteration of HA-LFP on L(G) One Matrix Vector Multiplication on G = Time : O((V+E) / M) Space: O(V + E) V : # of nodes E : # of nodes M : # of machines 13**Outline**• Problem Definition • Proposed Method • Experiment • Conclusion**Questions**Q1: How fast is HA-LFP? Q2: How does HA-LFP scale-up? Q3: How can we find `good’ and `bad’ sites in a web graph?**Running Time**Q1: How fast is HA-LFP? [10 iteration] 16**Scale Up**Q2: How does HA-LFP scale-up? Linear on the number of machines, edges 17**Advantage of HA-LFP**• Scalability • The only solution when the node information cannot fit in memory. • Near-linear scale up • Running Time • Faster than the single-machine, for large graphs • Fault Tolerance 18**Analysis of Web Graph**Q3: How can we find `good’ and `bad’ sites in a web graph? Pages whose goodness scores < 0.9 are likely to be adult pages 19**Outline**• Problem Definition • Proposed Method • Experiment • Conclusion**Conclusion**HA-LFP Belief Propgation for billion-scale graphs on Hadoop Near-linear scalability on # of machines, edges Many applications Finding `good’ and `bad’ web sites Fraud detection … 21**Spectral Analysis of**Billion-Scale Graphs: Discoveries and Implementation U Kang Brendan Meeder Christos Faloutsos School of Computer Science Carnegie Mellon University 22**Outline**Problem Definition Proposed Method Experiment Conclusion 23**Problem Definition**• Eigensolver • Computes top-k eigenvalues and eigenvectors • Application: • SVD, triangle counting, spectral clustering, … • Existing eigensolver • Can handle up to millions of nodes • How can we scale up eigensolvers to billion-scale graphs? 24**Outline**Problem Definition Proposed Method Experiment Conclusion 25**Main Idea**• HEigen algorithm (Hadoop Eigen-solver) • Selective parallelize ‘Lanczos’ algorithm • Expensive operation: on Hadoop for scalability • Inexpensive operation: on a single-machine for accuracy • Block encoding • Block encoding, and then do matrix-vector multiplication • Exploiting skewness in matrix-matrix mult. • In matrix-matrix multiplication when a matrix is very large and the other is very small 26**Application of HEigen**• Triangle Counting • Real social networks have a lot of triangles • Friends of friends are friends • But: triangles are expensive to compute • (3-way join; several approx. algos) • Q: Can we do that quickly? • A: Yes! • #triangles = 1/6 Sum ( λi3 ) • (and, because of skewness in eigenvalues, • we only need the top few eigenvalues!) [Tsourakakis ICDM 2008]**Outline**Problem Definition Proposed Method Experiment Conclusion 28**Questions**Q1: How does HEigen scale-up? Q2: Which Matrix-Matrix multiplication algorithm runs the fastest? Q3: How can we find anomalous sites in a web graph? 29**Running Time**Q1: How does HEigen scale-up? Heigen-BLOCK is faster than PLAIN ver. Linear on the number of machines, edges**Scale Up**Q2: Which Matrix-Matrix multiplication algorithm runs the fastest? Cache-based MM runs the fastest!**Results**Q3: How can we find anomalous sites in a web graph? • Triangle counting on Twitter social network [Twitter 2009; ~ 3 billion edges] • U.S. politicians: moderate number of triangles vs. degree • Adult sites: very large number of triangles vs. degree 32**Outline**Problem Definition Proposed Method Experiment Conclusion 33**Conclusion**HEigen Eigensolver for billion-scale graphs on Hadoop Near-linear scalability on # of machines, edges Cache-based Matrix-Matrix multiplication: fastest! Anomalies in triangle counts Many applications Triangle counting SVD … 34**Patterns on the**Connected Components of Terabyte-Scale Graphs U Kang* Mary McGlohon*† Leman Akoglu* Christos Faloutsos* (*) SCS, Carnegie Mellon University (†) Google 35**Outline**• Problem Definition • Static Patterns • Evolution Patterns • Model • Conclusion 36**A large graph is composed of many connected components**Problem Definition YahooWeb graph |V| = 1.4 billion |E| = 6.7 billion 120 GBytes Count Size Q1: static patterns? Q2: evolution patterns? Q3: model? 37**Outline**• Problem Definition • Static Patterns • Evolution Patterns • Model • Conclusion 38**Q1: Static Patterns**• What are the regularities in the connected components of a static graph? • How do they look like? • Do the GCC and the other connected components look similar? Chain? Clique? Idea: use ‘density’ and ‘radius’ to find patterns 39**Density of Connected Component**• What is a good metric for the density of a connected component? • A candidate: |E| / |V| (“average degree”) • Problem: it increases over time Number of Edges Number of Nodes 40**Density of Connected Component**• We want a metric that can measure the ‘intrinsic’ density of a component • Proposed: Graph Fractal Dimension(GFD) • log |E| / log |V| Number of Edges Number of Edges [Leskovec+ KDD05] Number of Nodes Number of Nodes 41**Density of Connected Component**• Graph Fractal Dimension(GFD) • log |E| / log |V| Chain: GFD ~1 Star: GFD ~1 Bipartite Core: 1 < GFD < 2 Clique: GFD ~2 42**Density of Connected Component**What are the GFDs of connected components in a large, real graph? 43**Density of Connected Component**• GFDs of CCs in YahooWeb graph Number of Edges Number of Edges Slope= 1.08 Number of Nodes Number of Nodes GFDs of CCs are constant on average GFDs of CCs are slightly denser than the tree 44**Radius of Connected Component**Q1.1: What does the GCC look like? Q1.2: What do the rest CC’s look like? ( What are the GFDs?) 45**Slope=**1.38 Radius of Connected Component • What are the patterns of radii in connected components? Avg. Max. Radius Max. Core Chain Average Radius A1.1: GCC looks like a ‘kite’ A1.2: Chain-like disconnected components 46**Outline**• Problem Definition • Static Patterns • Evolution Patterns • Model • Conclusion 47**Q2: Evolution Patterns**• How do the connected components evolve? • Do largest connected components grow with the same rate? • How often does a newcomer join the disconnected components? newcomer ? ? 48**Gelling Point**• Gelling Point [McGlohon+ KDD08] • Diameter starts to shrink 49**Growth of Connected Component**• GFDs of Top 3 CC’s over time Before “gelling point”: GFDs of Top 3 CC’s stay constant, “tree” like. After “deviation point”: GFD of GCC takes off, becomes denser. 50