1 / 44

Techniques for Graph Analytics on Big Data

Techniques for Graph Analytics on Big Data. Usman Nisar , Arash Fard , John A. Miller Computer Science Department University of Georgia, Athens, GA. Outline. Introduction Subgraph Pattern Matching Types of Subgraph Pattern Matching Models of Computation Distributed Algorithms

ronny
Download Presentation

Techniques for Graph Analytics on Big Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Techniques for Graph Analytics on Big Data UsmanNisar, ArashFard, John A. Miller Computer Science Department University of Georgia, Athens, GA

  2. Outline • Introduction • Subgraph Pattern Matching • Types of Subgraph Pattern Matching • Models of Computation • Distributed Algorithms • Performance Evaluation • Graph Partitioning • Results Analysis • Conclusion

  3. Introduction • Directed graph G(V, E, l) • V is a set ofvertices • E ⊆ V x V is a set of directed edges • l: V →𝓝 is a function mapping vertices to labels • Versatility and expressivity • Social networks, web search engines, genome sequencing, etc. • Data sizes are growing rapidly • Facebook: 1 billion + users, average degree of 140 • Twitter: 200 million + users, 400 million tweets each day • Bigger datasets mean bigger graphs

  4. Subgraph Pattern Matching • A graphG’(V’, E’, l’) is a subgraph of G(V, E, l) if: • V’ ⊆ V • E’ ⊆ V’ x V’ ⊆ E • ∀v ∈ V’ , l’(v) = l(v) • A pattern (or query) is a graph that we want to find in a bigger graph • Subgraph pattern matching: Given a query graph, find all subgraphs of another graph (the datagraph) that are similar to the query based on certain criteria

  5. Types of Subgraph Pattern Matching • Exact • SubgraphIsomorphism • Heuristic • Graph Simulation • Dual Simulation • Strong Simulation

  6. Subgraph Isomorphism Exact matching from the pattern to data graph Labels must be the same Ullmann’salgorithm, VF2 NP-hard problem in general case Current solutions not practical for large graphs

  7. Example: SubgraphIsomorphism Matches: {4, 8, 6, 7}, {5, 8, 6, 7}

  8. Graph Simulation [1]  M. R. Henzinger, T. A. Henzinger, and P. W. Kopke, “Computing simulations on finite and infinite graphs,” in Foundations of Computer Science, 1995. Proceedings., 36th Annual Symposium on. IEEE, 1995, pp. 453–462. • A vertex in a data graph G matches a vertex in query graph Q via graph simulation iff • both have the same label • a subset of its children match all the children of its corresponding vertex in the query graph • HHK algorithm[1]

  9. Graph Simulation

  10. Dual Simulation • Adds a duality condition to graph simulation • A vertex in a data graph becomes a match with a vertex in the query graph iff • Both have the same label • A subset of its children matches all the children of its corresponding vertex in the data graph • A subset of its parents matches all the parents of its corresponding vertex in the data graph • O((|Vq| + |Eq|) (|V| + |E|)) • More restrictive than graph simulation

  11. Example: Graph  Dual Simulation

  12. Example: Graph  Dual Simulation

  13. Strong Simulation • Extends dual simulation with a locality condition • A ball G[v, r]is a subset of a graph G containing: • all vertices VBwithin an undirected distance r of the vertex v • all the edges between the vertices in VB

  14. Strong Simulation • Extends dual simulation with a locality condition • A ball G[v, r]is a subset of a graph G containing: • all vertices VBwithin an undirected distance r of the vertex v • all the edges between the vertices in VB

  15. Strong Simulation • Extends dual simulation with a locality condition • A ball G[v, r]is a subset of a graph G containing: • all vertices VBwithin an undirected distance r of the vertex v • all the edges between the vertices in VB • O(|V|(|V| + (|Vq| + |Eq|) (|V| + |E|)) • Query graph Q(Vq,Eq,l) matches data graph G(V, E, l) via strong simulation if there exists a vertex v ∈ Vs.t. • Q matches G[v, dq] via dual simulation with match relation Rb • v is contained in Rb

  16. Example: Dual  Strong Simulation

  17. Example: Dual  Strong Simulation

  18. Example: Dual  Strong Simulation

  19. Example: Dual  Strong Simulation

  20. Models of Computation

  21. Distributed Graph Simulation

  22. Distributed Graph Simulation

  23. Other Distributed Graph Algorithms • Dual simulation • Similar to graph simulation • In addition to storing the match sets of its children, a vertex also stores the match sets of its parents • During match set evaluation, a vertex takes both of these into account • Strong simulation (optimized) • First run dual simulation to obtain match relation R • For each matching vertex v in R: • Create a ball centered at v with radius dqcontaining only vertices in R • Perform dual simulation on the ball

  24. Performance Evaluation • Experimental setup • Cluster with 12 machines • Each with two 2Ghz Intel Xeon E5-2620 CPUs, each with six cores • Ethernet: 1 Gb/s • Set up HDFS/GPS on all of the machines

  25. Runtime Evaluation Synthesized, |V|= 108

  26. Runtime Evaluation uk-2005, |V|= 3.9 x 107

  27. Runtime Evaluation enwiki-2013, |V|= 4.2 x 106

  28. Speedup Evaluation Synthesized, |V|= 108

  29. Speedup Evaluation uk-2005, |V|= 3.9 x 107

  30. Speedup Evaluation enwiki-2013, |V|= 4.2 x 106

  31. Conclusions • The three algorithms exhibit excellent scalable behavior in terms of speedup and efficiency as we increase the number of workers • Distributed implementations (GPS): • Graph simulation • Dual simulation • Strong simulation • Ongoing work: • Distributed implementations with message passing in Akka • Our own sequential/distributed isomorphism algorithm

  32. Questions?

  33. Thanks

  34. Depth-First Ball Algorithm works in a depth-first fashion. Message is generated at the center which is then propagated through the system for ballSizesupersteps. The approach results in an exponential number of messages that slows down the whole system and renders the approach impractical

  35. Breadth-First Ball Works on a simple ping-reply model. Center vertex starts of by sending a ping message to all of its adjacent nodes in the first superstep. In the second superstep, all the recipient nodes reply back with their label and the ids of their children and parents. Center vertex upon receiving this information in the third superstep, saves the returned labels and then ping the boundary nodes. This process is repeated till we have a ball of size dq.

  36. Breadth-First Ball

  37. Efficiency • Calculated as Efficiencyk = Speedupk/k Synthesized, |V|=108 uk-2005, |V|=3.9x107 enwiki-2013, |V|=4.2x106

  38. Graph Partitioning 1 http://glaros.dtc.umn.edu/gkhome/metis/metis/overview • By default, the data graph is partitioned in a round-robin fashion among workers • The goals of min-cut partitioning are two-fold: • to create well-balanced partitions • to reduce the inter-partition edges • Number of other algorithms have been shown to use min-cut graph partitioning successfully for speed-ups • METIS1 – graph partitioning tool • Written in C • Takes an undirected graph and outputs the partitions

  39. Performance Evaluation • Experimental Setup • Cluster with 5 machines • Each with two 2Ghz Intel Xeon E5-2620 CPU, each with six cores • Ethernet: 1 Gb/sec • Setup HDFS/GPS on all the machines • Datasets: • Labels(l) = 200

  40. Results - Runtime Graph Simulation Dual Simulation Strict Simulation Synthesized Dataset, |V|=107, = 1.2

  41. Results - Runtime Graph Simulation Dual Simulation Strict Simulation uk-2002, |V|=1.8x107

  42. Results – Network I/O Graph Simulation Dual Simulation Strict Simulation Synthesized Dataset, |V|=107,

  43. Results – Network I/O Graph Simulation Dual Simulation Strict Simulation uk-2002, |V|=1.8x107

  44. Planned Pattern Matching Implementations

More Related