If you were plowing a field, which would you rather use?

If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

A Yoke of Oxen and a Thousand Chickens for Heavy Lifting Graph Processing Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto and Matei Ripeanu NetSysLab The University of British Columbia http://netsyslab.ece.ubc.ca

Graphs are Everywhere

[Hong, S. 2011] Graphs Processing Challenges GPUs CPUs • Caches + summary data structures • Low compute-to-memory access ratio • Massive hardware multithreading • Poor locality • >128GB • 6GB • Large memory footprint

Motivating Question Can we efficiently use hybrid systemsfor large-scale graph processing? YES WE CAN! 2x speedup (4 billion edges)

Methodology • Performance Model • Predicts speedup • Intuitive • Totem • A graph processing engine for hybrid systems • Applies algorithm-agnostic optimizations • Evaluation • Predicated vs achieved • Hybrid vs Symmetric

c = b β= rcpu= α= The Performance Model (I) • Predicts the speedup obtained from offloading part of the graph to the GPU (when compared to processing only on the host)

c = β = rcpu = β= 20% rcpu= 0.5 BEPS x Best reported single-node BFS performance [Agarwal, V. 2010] α = The Performance Model (II) Assume PCI-E bus, b ≈ 4 GB/sec and per edge state m = 4 bytes => c = 1 billion EPS Worst case (e.g., bipartite graph) |V| = 32M, |E| = 1B • It is beneficial to process the graph on a hybrid system • if communication overhead is kept low

. . . Totem: Programming Model Bulk Synchronous Parallel • Rounds of computation and communication phases • Updates to remote vertices are delivered in the next round • Partitions vote to terminate execution

Totem: A BSP-based Engine Compressed sparse row representation Computation: kernel manipulates local state Comm1: transfer outbox buffer to remote input buffer Comm2: merge with local state Updates to remote vertices aggregated locally

|E| = 512 Million sparse graph: ~5x reduction Random The Aggregation Opportunity real-world graphs are mostly scale-free: skewed degree distribution Denser graph has better opportunity for aggregation: ~50x reduction

Evaluation Setup Workload • R-MAT graphs • |V|=32M, |E|=1B, unless otherwise noted Algorithms • Breadth-first Search • PageRank Metrics • Speedup compared to processing on the host only Testbed • Host: dual-socket Intel Xeon with 16GB • GPU: Nvidia Tesla C2050 with 3GB

Predicted vs Achieved Speedup Linear speedup with respect to offloaded part After aggregation, β = 2%. A low value is critical for BFS GPU partition fills GPU memory

Breakdown of Execution Time Aggregation significantly reduced communication overhead PageRank is dominated by the compute phase GPU is > 5x faster than the host

Effect of Graph Density Deviation due to not incorporating pre- and post-transfer overheads in the model Sparser graphs have higher β

Contributions • Performance modeling • Simple • Useful for initial system provisioning • Totem • Generic graph processing framework • Algorithm-agnostic optimizations • Evaluation (Graph500 scale-28) • 2x speedup over a symmetric system • 1.13 Billion TEPS edges on a dual-socket, dual-GPU system

Questions? • code available at: netsyslab.ece.ubc.ca 17

If you were plowing a field, which would you rather use?