1 / 24

Parallel Subgraph Listing in a Large-Scale Graph

Parallel Subgraph Listing in a Large-Scale Graph. Yingxia Shao  Bin Cui  Lei Chen  Lin Ma  Junjie Yao  Ning Xu   School of EECS, Peking University  Hong Kong University of Science and Technology. Outline. Subgraph listing operation Related work PSgL framework Evaluation

Download Presentation

Parallel Subgraph Listing in a Large-Scale Graph

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Subgraph Listing in a Large-Scale Graph Yingxia ShaoBin Cui Lei ChenLin Ma Junjie Yao NingXu   School of EECS, Peking University  Hong Kong University of Science and Technology

  2. Outline • Subgraph listing operation • Related work • PSgL framework • Evaluation • Conclusion

  3. Motivation Triangle Counting in SN Cascades Counting in RN Motif Detection in Bioinformatics

  4. Problem Definition Subgraph Listing Operation • Input:pattern graph, data graph[both are undirected] • Output: all the occurrencesof pattern graph in the data graph. Goal of our work • Efficiently listing subgraph in a large-scale graph Pattern graph Data graph

  5. Related Work Centralized algorithms • Enumerate one by one [Chiba ’85, Wernicke ’06, Grochow ’07] Streaming algorithms • Only counting and results are inaccurate [Buriol ’06, Bordino ’08, Zhao ’10] MapReduce based Parallel algorithms • Decompose pattern graph + explicit join operation [Afrati ’13] • Fixed exploration plan + implicit join operation [Plantenga ’13] Other efficient algorithms for specific pattern graph • Triangle [Suri ’11, Chu ’11, Hu ’13]

  6. Drawbacks in existing parallel solutions • MapReduce is not friendly to process graphs. • Join operation is expensive. • Do not take care of the balance of data distribution. • Data graph • Intermediate results The novel PSgLframework lists subgraph via graph traversal on in-memory stored native graph.

  7. Contributions • We propose an efficient parallel subgraph listing framework, PSgL. • We introduce a cost model for the subgraph listing in PSgL. • We propose a simple but effective workload-aware distribution strategy, which facilitates PSgLto achieve good workload balance. • We design three independent mechanisms to reduce the size of intermediate results.

  8. Partial subgraph instance • A data structure that records the mapping between pattern graph and data graph. • Denoted by • Assume the vertices of are numbered from 1 to , we simply state as {map(1), map(2), ..., map()}. {?,?,?,?} {2,3,4,5} {1,5,6,?}

  9. Independence Property • Tree • A node is a • The children of a node are derived from expanding one mapped data vertex in the node. • Characteristics • A encodes a set of results. • s are independent from each other except the ones in its generation path. Tree

  10. PSgL: Parallel Subgraph Listing Framework • PSgLfollows the popular graph processing paradigm • vertex centric model • BSP model • PSgL iteratively generates in parallel; • Each is expanded by a data vertex.

  11. Algorithm of Expanding a - I • Partial Pattern Graph encodes • pattern graph, • , • progress state. • Three types of vertices • BLACK vertex is the one which has been expanded. • GRAY vertex has a mapped data vertex, but it has not been expanded. • WHITE vertex is the one which hasn’t been mapped to any data vertex.

  12. Algorithm of Expanding a Gpsi - II • Main logic • Changes one GRAY vertex into BLACK; • Validates the expanding vertex’s GRAY neighbors; • Makes the expanding vertex’s WHITE neighbor become GRAY. • Two observations • In each expansion, at least one pattern vertex is processed. • All GRAYs are the valid candidates for the next expansion. Example: expanding vertex <4, 2>

  13. Efficiency of PSgL # of Gpsi processed by worker k # of iterations Total cost cost of processing a Gpsi # of workers • Three metrics: • The number of iterations. • S is bounded by |MVC| ≤ S ≤ |Vp| - 1. • Workload balance. • Required by the max function. • The number of s *Refer to the paper for the details of estimating load().

  14. Workload balance - I • Partial subgraph instance distribution problem • There are N s to be processed by K workers, the goal is to find out a distribution strategy to achieve • NP-hard problem! • Naive Solutions • Random distribution strategy • Roulette wheel distribution strategy • has a higher probability to be expanded by a data vertex with smaller degree.

  15. Workload balance - II • Workload aware distribution strategy • A general greedy-based heuristic rule. All three strategies have the same worst bound which is K*|OPT|. But in practice, α = 0.5 performs best.

  16. Comparison among various approaches Random Roulette =1 =0 =0

  17. Partial subgraphinstance reduction - I • Pattern graph automorphism breaking • Using DFS to find the equivalent vertex group • Assign partial order for each equivalent vertex group • Initial pattern vertex selection • Introduce a cost model • General pattern graph • Enumerate all possible selections based on cost model • Cycle and clique • The vertex with lowest rank is the best one. Cost Model Best Initial Pattern Vertex < < < Initial Pattern Vertex Section based on cost model Automorphism Breaking

  18. Partial subgraphinstance reduction - II • Online pruning invalid s • Filter by the partial order and degree restriction • Prune with the help of a light weight global edge index • Using bloom filter to index the ends of an edge PG1 PG4 PG5

  19. Evaluation - Comparing to MR solutions • Afrati and SGIA-MR are the state-of-art MapReduce solutions. • The ratios exceed 100 times are not visualized. PSgL: 4302s Afrati: 7291s

  20. Evaluation - Comparing to GraphLab * using a different traversal order.

  21. Conclusion • Subgraph listing is a fundamental operation for massive graph analysis. • We propose an efficient parallel subgraph listing framework, PSgL. • Various distribution strategies • Cost model • Light-weight global edge index • The workload-aware distribution strategy can be extended to other balance problems. • A new execution engine is required for larger pattern graphs.

  22. Thanks!

  23. Backup Expr. – Scalability of PSgL Performance vs. Worker Number

  24. Backup Expr. – Initial pattern vertex selection Random graph Livejournal Influences of the Initial Pattern Vertex on Various Data Graphs

More Related