170 likes | 274 Views
Explore the benefits of using a yoke of oxen and a thousand chickens for heavy lifting in graph processing challenges. Discover the methodology, performance models, and insights on achieving 2x speedup in processing billions of edges with hybrid systems. Evaluation includes speedup predictions, BSP-based engine, optimizations, evaluation setup, and breakdown of execution time. Contributions include performance modeling, algorithm-agnostic optimizations, and a 2x speedup over a symmetric system. Find code available at netsyslab.ece.ubc.ca.
E N D
If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)
A Yoke of Oxen and a Thousand Chickens for Heavy Lifting Graph Processing Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto and Matei Ripeanu NetSysLab The University of British Columbia http://netsyslab.ece.ubc.ca
[Hong, S. 2011] Graphs Processing Challenges GPUs CPUs • Caches + summary data structures • Low compute-to-memory access ratio • Massive hardware multithreading • Poor locality • >128GB • 6GB • Large memory footprint
Motivating Question Can we efficiently use hybrid systemsfor large-scale graph processing? YES WE CAN! 2x speedup (4 billion edges)
Methodology • Performance Model • Predicts speedup • Intuitive • Totem • A graph processing engine for hybrid systems • Applies algorithm-agnostic optimizations • Evaluation • Predicated vs achieved • Hybrid vs Symmetric
c = b β= rcpu= α= The Performance Model (I) • Predicts the speedup obtained from offloading part of the graph to the GPU (when compared to processing only on the host)
c = β = rcpu = β= 20% rcpu= 0.5 BEPS x Best reported single-node BFS performance [Agarwal, V. 2010] α = The Performance Model (II) Assume PCI-E bus, b ≈ 4 GB/sec and per edge state m = 4 bytes => c = 1 billion EPS Worst case (e.g., bipartite graph) |V| = 32M, |E| = 1B • It is beneficial to process the graph on a hybrid system • if communication overhead is kept low
. . . Totem: Programming Model Bulk Synchronous Parallel • Rounds of computation and communication phases • Updates to remote vertices are delivered in the next round • Partitions vote to terminate execution
Totem: A BSP-based Engine Compressed sparse row representation Computation: kernel manipulates local state Comm1: transfer outbox buffer to remote input buffer Comm2: merge with local state Updates to remote vertices aggregated locally
|E| = 512 Million sparse graph: ~5x reduction Random The Aggregation Opportunity real-world graphs are mostly scale-free: skewed degree distribution Denser graph has better opportunity for aggregation: ~50x reduction
Evaluation Setup Workload • R-MAT graphs • |V|=32M, |E|=1B, unless otherwise noted Algorithms • Breadth-first Search • PageRank Metrics • Speedup compared to processing on the host only Testbed • Host: dual-socket Intel Xeon with 16GB • GPU: Nvidia Tesla C2050 with 3GB
Predicted vs Achieved Speedup Linear speedup with respect to offloaded part After aggregation, β = 2%. A low value is critical for BFS GPU partition fills GPU memory
Breakdown of Execution Time Aggregation significantly reduced communication overhead PageRank is dominated by the compute phase GPU is > 5x faster than the host
Effect of Graph Density Deviation due to not incorporating pre- and post-transfer overheads in the model Sparser graphs have higher β
Contributions • Performance modeling • Simple • Useful for initial system provisioning • Totem • Generic graph processing framework • Algorithm-agnostic optimizations • Evaluation (Graph500 scale-28) • 2x speedup over a symmetric system • 1.13 Billion TEPS edges on a dual-socket, dual-GPU system
Questions? • code available at: netsyslab.ece.ubc.ca 17