1 / 21

Streaming Graph Partitioning for Large Distributed Graphs

Streaming Graph Partitioning for Large Distributed Graphs. Isabelle Stanton, UC Berkeley Gabriel Kliot , Microsoft Research XCG. Motivation. Modern graph datasets are huge The web graph had over a trillion links in 2011. Now?

madison
Download Presentation

Streaming Graph Partitioning for Large Distributed Graphs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft Research XCG

  2. Motivation • Modern graph datasets are huge • The web graph had over a trillion links in 2011. Now? • facebook has “more than 901 million users with average degree 130” • Protein networks

  3. Motivation • We still need to perform computations, so we have to deal with large data • PageRank (and other matrix-multiply problems) • Broadcasting status updates • Database queries • And on and on and on… Graph has to be distributed across a cluster of machines! P QL

  4. Motivation • Edges cut correspond (approximately) to communication volume required • Too expensive to move data on the network • Interprocessor communication: nanoseconds • Network communication: microseconds • The data has to be loaded onto the cluster at some point… • Can we partition while we load the data?

  5. High Level Background • Graph partitioning is NP-hard on a good day • But then we made it harder: • Graphs like social networks are notoriously difficult to partition (expander-like) • Large data sets drastically reduce the amount of computation that is feasible – O(n) or less • The partitioning algorithms need to be parallel and distributed

  6. The Streaming Model Possible Buffer of size Each machine holds nodes Graph Stream → • Graph is ordered: • Random • Breadth-First Search • Depth-First Search Partitioner Goal: Generate an approximately balanced k-partitioning

  7. Lower Bounds On Orderings Best balanced -partition cuts edges • Adversarial Ordering • Give every other vertex • See no edges till ! • Can’t compete • DFS Ordering • Stream is connected • Greedy will do optimally Theory says these types of algorithms can’t do well • Random Ordering • Birthday paradox: won’t see edges until • Still can’t compete with edges cut

  8. Current Approach in Real Systems • Totally ignore edges and hash vertex ID • Pro • Fast to locate data • Doesn’t require a complex DHT or synchronization • Con • Hashing the vertex ID cuts a fraction of the edges for any order • Great simple approximation for MAX-CUT

  9. Our Approach • Evaluate 16 natural heuristics on 21 datasets with each of the three orderings with varying numbers of partitions • Find out which heuristics work on each graph • Compare these with the results of • Random Hashing to get worst case • METIS to get ‘best’ offline performance

  10. Caveats • METIS is a heuristic, not true lower bound • Does fine in practice • Available online for reproducing results • Used publicly available datasets • Public graph datasets tend to be much smaller than what companies have • Using meta-data for partitioning can be good • partitioning the web graph by URL • Using geographic location for social network users

  11. Heuristics • Balanced • Chunking • Hashing • (weighted) Deterministic Greedy • (weighted) Randomized Greedy • Triangles • Balance Big Uses a Buffer of size • Prefer Big • Avoid Big • Greedy EvoCut Weight functions Unweighted Linear weighted Exponentially weighted

  12. Datasets • Includes finite element meshes, citation networks, social networks, web graphs, protein networks and synthetically generated graphs • Sizes: 297 vertices to 41.7 million vertices • Synthetic graph models • Barabasi-Albert (Preferential Attachment) • RMAT (Kronecker) • Watts-Strogatz • Power law-Clustered • Biggest graphs: LiveJournal and Twitter

  13. Experimental Method • For each graph, heuristic, and ordering, partition into 2, 4, 8, 16 pieces • Compare with a random cut – upper bound • Compare with METIS – lower bound • Performance was measured by:

  14. BFS DFS Random Heuristic Results Synthetic Hash Best heuristic, LDG, gets an average improvement of 76% over all datasets! METIS Finite element mesh Social network

  15. Scaling in the Size of Graphs: Exploiting Synthetic Graphs Hash LDG METIS

  16. More Observations • BFS is a superior ordering for all algorithms • Avoid Big does 46% WORSE on average than Random Cut • Further experiments showed Linear Det. Greedy has identical performance to Det. Greedy with load-based tie breaking.

  17. Results on a Real System • Compared the streamed partitioning with random hashing on SPARK, a distributed cluster computation system (http://www.spark-project.org/) • Used 2 datasets • 4.6 million users, 77 million edges • 41.7 million users, 1.468 billion edges • Computed the PageRank of each graph

  18. Results on SPARK • LiveJournal – 4.6 million users, 77 million edges • Twitter – 41.7 million users, 1.468 billion edges Twitter Improvement: Naïve – 19.1% Combiner – 18.8 % LJ Improvement: Naïve – 38.7% Combiner – 28.8 %

  19. Streaming graph partitioning is a really nice, simple, effective preprocessing step.

  20. isabelle@eecs.berkeley.edu Where to now? • Can we explain theoretically why the greedy algorithm performs so well?* • What heuristics work better? • What heuristics are optimal for different classes of graphs? • Use multiple parallel streams! • Implement in real systems! *Work under submission: I. Stanton, Streaming Balanced Graph Partitioning Algorithms for Random Graphs

  21. isabelle@eecs.berkeley.edu Acknowledgements • David B. Wecker • Burton Smith • Reid Andersen • Nikhil Devanur • SamehElkinety • SreenivasGollapudi • YuxiongHe • RinaPanigrahy • Yuval Peres All at MSR • SatishRao • Virginia Vassilevska Williams • Alexandre Stauffer • Ngoc Mai Tran • MiklosRacz • MateiZaharia All at Berkeley - CS and Statistics Supported by NSF and NDSEG fellowships, NSF grant CCF-0830797, and an internship at Microsoft Research’s eXtreme Computing Group.

More Related