Parallel Graph Algorithms

Parallel Graph Algorithms KameshMadduri KMadduri@lbl.gov Lawrence Berkeley National Laboratory CS267, Spring 2011 April 7, 2011

Lecture Outline • Applications • Designing parallel graph algorithms, performance on current systems • Case studies: Graph traversal-based problems, parallel algorithms • Breadth-First Search • Single-source Shortest paths • Betweenness Centrality

Routing in transportation networks Road networks, Point-to-point shortest paths: 15 seconds (naïve)  10 microseconds H. Bast et al., “Fast Routing in Road Networks with Transit Nodes”,Science 27, 2007.

Internet and the WWW • The world-wide web can be represented as a directed graph • Web search and crawl: traversal • Link analysis, ranking: Page rank and HITS • Document classification and clustering • Internet topologies (router networks) are naturally modeled as graphs

Scientific Computing • Reorderings for sparse solvers • Fill reducing orderings • Partitioning, eigenvectors • Heavy diagonal to reduce pivoting (matching) • Data structures for efficient exploitation of sparsity • Derivative computations for optimization • graph colorings, spanning trees • Preconditioning • Incomplete Factorizations • Partitioning for domain decomposition • Graph techniques in algebraic multigrid • Independent sets, matchings, etc. • Support Theory • Spanning trees & graph embedding techniques Image source: YifanHu, “A gallery of large graphs” Image source: Tim Davis, UF Sparse Matrix Collection. B. Hendrickson, “Graphs and HPC: Lessons for Future Architectures”,http://www.er.doe.gov/ascr/ascac/Meetings/Oct08/Hendrickson%20ASCAC.pdf

Large-scale data analysis • Graph abstractions are very useful to analyze complex data sets. • Sources of data: petascale simulations, experimental devices, the Internet, sensor networks • Challenges: data size, heterogeneity, uncertainty, data quality Social Informatics:new analytics challenges, data uncertainty Astrophysics: massive datasets, temporal variations Bioinformatics: data quality, heterogeneity Image sources: (1) http://physics.nmt.edu/images/astro/hst_starfield.jpg (2,3) www.visualComplexity.com

Data Analysis and Graph Algorithms in Systems Biology • Study of the interactions between various components in a biological system • Graph-theoretic formulations are pervasive: • Predicting new interactions: modeling • Functional annotation of novel proteins: matching, clustering • Identifying metabolic pathways: paths, clustering • Identifying new protein complexes: clustering, centrality Image Source: Giot et al., “A Protein Interaction Map of Drosophila melanogaster”, Science 302, 1722-1736, 2003.

Graph –theoretic problems in social networks • Targeted advertising: clustering and centrality • Studying the spread of information Image Source: Nexus (Facebookapplication)

Network Analysis for Intelligence and Survelliance • [Krebs ’04] Post 9/11 Terrorist Network Analysis from public domain information • Plot masterminds correctly identified from interaction patterns: centrality • A global view of entities is often more insightful • Detect anomalous activities by exact/approximate subgraph isomorphism. Image Source: http://www.orgnet.com/hijackers.html Image Source: T. Coffman, S. Greenblatt, S. Marcus, Graph-based technologies for intelligence analysis, CACM, 47 (3, March 2004): pp 45-47

Research in Parallel Graph Algorithms Graph Algorithms Application Areas Methods/ Problems Architectures GPUs FPGAs x86 multicore servers Massively multithreaded architectures Multicore Clusters Clouds Traversal Shortest Paths Connectivity Max Flow … … … Social Network Analysis WWW Computational Biology Scientific Computing Engineering Find central entities Community detection Network dynamics Data size Marketing Social Search Gene regulation Metabolic pathways Genomics Problem Complexity Graph partitioning Matching Coloring VLSI CAD Route planning

Characterizing Graph-theoretic computations Factors that influence choice of algorithm Input data Graph kernel • graph sparsity (m/n ratio) • static/dynamic nature • weighted/unweighted,weight distribution • vertex degree distribution • directed/undirected • simple/multi/hyper graph • problem size • granularity of computation at nodes/edges • domain-specific characteristics • traversal • shortest path algorithms • flow algorithms • spanning tree algorithms • topological sort • ….. Problem: Find *** • paths • clusters • partitions • matchings • patterns • orderings Graph problems are often recast as sparse linear algebra (e.g., partitioning) or linear programming (e.g., matching) computations

History • 1735: “Seven Bridges of Königsberg” problem, resolved by Euler, one of the first graph theory results. • … • 1966: Flynn’s Taxonomy. • 1968: Batcher’s “sorting networks” • 1969: Harary’s “Graph Theory” • … • 1972: Tarjan’s “Depth-first search and linear graph algorithms” • 1975: Reghbati and Corneil, Parallel Connected Components • 1982: Misra and Chandy, distributed graph algorithms. • 1984: Quinn and Deo’s survey paper on “parallel graph algorithms” • …

The PRAM model • Idealized parallel shared memory system model • Unbounded number of synchronous processors; no synchronization, communication cost; no parallel overhead • EREW (Exclusive Read Exclusive Write), CREW (Concurrent Read Exclusive Write) • Measuring performance: space and time complexity; total number of operations (work)

PRAM Pros and Cons • Pros • Simple and clean semantics. • The majority of theoretical parallel algorithms are designed using the PRAM model. • Independent of the communication network topology. • Cons • Not realistic, too powerful communication model. • Algorithm designer is misled to use IPC without hesitation. • Synchronized processors. • No local memory. • Big-O notation is often misleading.

PRAM Algorithms for Connected Components • Reghbati and Corneil [1975], O(log2n) time and O(n3) processors • Wyllie [1979], O(log2n) time and O(m) processors • Shiloach and Vishkin [1982], O(log n) time and O(m) processors • Reif and Spirakis [1982], O(log log n) time and O(n) processors (expected)

Building blocks of classical PRAM graph algorithms • Prefix sums • Symmetry breaking • Pointer jumping • List ranking • Euler tours • Vertex collapse • Tree contraction

Data structures: graph representation Static case • Dense graphs (m = O(n2)): adjacency matrix commonly used. • Sparse graphs: adjacency lists Dynamic • representation depends on common-case query • Edge insertions or deletions? Vertex insertions or deletions? Edge weight updates? • Graph update rate • Queries: connectivity, paths, flow, etc. • Optimizing for locality a key design consideration.

Graph representation Vertex Degree Adjacencies • Compressed Sparse Row-like 5 8 1 0 7 3 4 6 9 Flatten adjacency arrays 2 Index into adjacency array Size: n+1 Adjacencies Size: 2*m

Distributed Graph representation • Each processor stores the entire graph (“full replication”) • Each processor stores n/p vertices and all adjacencies out of these vertices (“1D partitioning”) • How to create these “p” vertex partitions? • Graph partitioning algorithms: recursively optimize for conductance (edge cut/size of smaller partition) • Randomly shuffling the vertex identifiers ensures that edge count/processor are roughly the same

2D graph partitioning • Consider a logical 2D processor grid (pr * pc = p) and the matrix representation of the graph • Assign each processor a sub-matrix (i.e, the edges within the sub-matrix) 9 vertices, 9 processors, 3x3 processor grid 5 8 1 0 7 3 4 6 Flatten Sparse matrices 2 Per-processor local graph representation

Data structures in (parallel) graph algorithms • A wide range seen in graph algorithms: array, list, queue, stack, set, multiset, tree • Implementations are typically array-based for performance considerations. • Key data structure considerations in parallel graph algorithm design • Practical parallel priority queues • Space-efficiency • Parallel set/multisetoperations, e.g., union, intersection, etc.

Graph Algorithms on current systems • Concurrency • Simulating PRAM algorithms: hardware limits of memory bandwidth, number of outstanding memory references; synchronization • Locality • Try to improve cache locality, but avoid “too much” superfluous computation • Work-efficiency • Is (Parallel time) * (# of processors) = (Serial work)?

The locality challenge“Large memory footprint, low spatial and temporal locality impede performance” Serial Performance of “approximate betweenness centrality” on a 2.67 GHz Intel Xeon 5560 (12 GB RAM, 8MB L3 cache) Input: Synthetic R-MAT graphs (# of edges m = 8n) No Last-level Cache (LLC) misses ~ 5X drop in performance O(m) LLC misses

The parallel scaling challenge “Classical parallel graph algorithms perform poorly on current parallel systems” • Graph topology assumptions in classical algorithms do not match real-world datasets • Parallelization strategies at loggerheads with techniques for enhancing memory locality • Classical “work-efficient” graph algorithms may not fully exploit new architectural features • Increasing complexity of memory hierarchy, processor heterogeneity, wide SIMD. • Tuning implementation to minimize parallel overhead is non-trivial • Shared memory: minimizing overhead of locks, barriers. • Distributed memory: bounding message buffer sizes, bundling messages, overlapping communication w/ computation.

Graph traversal (BFS) problem definition 1 2 distance from source vertex Input: Output: 5 8 4 1 3 1 3 4 2 0 7 3 4 6 9 source vertex 1 2 • Memory requirements (# of machine words): • Sparse graph representation: m+n • Stack of visited vertices: n • Distance array: n

Optimizing BFS on cache-based multicore platforms, for networks with “power-law” degree distributions Test data (Data: Mislove et al., IMC 2007.) Synthetic R-MAT networks

Parallel BFS Strategies 5 8 1. Expand current frontier (level-synchronous approach, suited for low diameter graphs) 1 • O(D) parallel steps • Adjacencies of all vertices • in current frontier are • visited in parallel 0 7 3 4 6 9 source vertex 2 2. Stitch multiple concurrent traversals (Ullman-Yannakakis approach, suited for high-diameter graphs) 5 8 • path-limited searches from “super vertices” • APSP between “super vertices” 1 source vertex 0 7 3 4 6 9 2

A deeper dive into the “level synchronous” strategy Locality (where are the random accesses originating from?) 1. Ordering of vertices in the “current frontier” array, i.e., accesses to adjacency indexing array, cumulative accesses O(n). 2. Ordering of adjacency list of each vertex, cumulative O(m). 3. Sifting through adjacencies to check whether visited or not, cumulative accesses O(m). 84 53 93 31 0 44 74 63 11 26 1. Access Pattern: idx array -- 53, 31, 74, 26 2,3. Access Pattern: d array -- 0, 84, 0, 84, 93, 44, 63, 0, 0, 11

Performance Observations Flickr social network Youtube social network Edge filtering Graph expansion

Improving locality: Vertex relabeling x x x x x x x x x x x x x • Well-studied problem, slight differences in problem formulations • Linear algebra: sparse matrix column reordering to reduce bandwidth, reveal dense blocks. • Databases/data mining: reordering bitmap indices for better compression; permuting vertices of WWW snapshots, online social networks for compression • NP-hard problem, several known heuristics • We require fast, linear-work approaches • Existing ones: BFS or DFS-based, Cuthill-McKee, Reverse Cuthill-McKee, exploit overlap in adjacency lists, dimensionality reduction x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x

Improving locality: Optimizations • Recall: Potential O(m) non-contiguous memory references in edge traversal (to check if vertex is visited). • e.g., access order: 53, 31, 31, 26, 74, 84, 0, … • Objective: Reduce TLB misses, private cache misses, exploit shared cache. • Optimizations: • Sort the adjacency lists of each vertex – helps order memory accesses, reduce TLB misses. • Permute vertex labels – enhance spatial locality. • Cache-blocked edge visits – exploit temporal locality. 84 53 93 31 0 44 74 63 11 26

Improving locality: Cache blocking Metadata denoting blocking pattern Process high-degree vertices separately Adjacencies (d) Adjacencies (d) • Instead of processing adjacencies of each vertex serially, exploit sorted adjacency list structure w/ blocked accesses • Requires multiple passes through the frontier array, tuning for optimal block size. • Note: frontier array size may be O(n) 3 2 1 x x x x x x x x x x x x x x x x x x x x x x x x frontier frontier x x Tune to L2 cache size x x x x x x x x x x x x New: cache-blocked approach linear processing

Vertex relabeling heuristic Similar to older heuristics, but tuned for small-world networks: • High percentage of vertices with (out) degrees 0, 1, and 2 in social and information networks => store adjacencies explicitly (in indexing data structure). • Augment the adjacency indexing data structure (with two additional words) and frontier array (with one bit) • Process “high-degree vertices” adjacencies in linear order, but other vertices with d-array cache blocking. • Form dense blocks around high-degree vertices • Reverse Cuthill-McKee, removing degree 1 and degree 2 vertices

Architecture-specific Optimizations 1. Software prefetchingon the Intel Core i7 (supports 32 loads and 20 stores in flight) • Speculative loads of index array and adjacencies of frontier vertices will reduce compulsory cache misses. 2. Aligning adjacency lists to optimize memory accesses • 16-byte aligned loads and stores are faster. • Alignment helps reduce cache misses due to fragmentation • 16-byte aligned non-temporal stores (during creation of new frontier) are fast. 3. SIMD SSE integer intrinsicsto process “high-degree vertex” adjacencies. 4. Fast atomics (BFS is lock-free w/ low contention, and CAS-based intrinsicshave very low overhead) 5. Hugepagesupport (significant TLB miss reduction) 6. NUMA-aware memory allocation exploiting first-touch policy

Experimental Setup • Intel Xeon 5560 (Core i7, “Nehalem”) • 2 sockets x 4 cores x 2-way SMT • 12 GB DRAM, 8 MB shared L3 • 51.2 GBytes/sec peak bandwidth • 2.66 GHz proc. Performance averaged over 10 different source vertices, 3 runs each.

Impact of optimization strategies * Optimization speedup (performance on 4 cores) w.r.t baseline parallel approach, on a synthetic R-MAT graph (n=223,m=226)

Cache locality improvement Performance count: # of non-contiguous memory accesses (assuming cache line size of 16 words) Theoretical count of the number of non-contiguous memory accesses: m+3n

Parallel performance (Orkut graph) Execution time: 0.28 seconds (8 threads) Parallel speedup: 4.9 Speedup over baseline: 2.9 Graph: 3.07 million vertices, 220 million edges Single socket of Intel Xeon 5560 (Core i7)

Performance Analysis • How well does my implementation match theoretical bounds? • When can I stop optimizing my code? • Begin with asymptotic analysis • Express work performed in terms of machine-independent performance counts • Add input graph characteristics in analysis • Use simple kernels or micro-benchmarks to provide an estimate of achievable peak performance • Relate observed performance to machine characteristics

Graph 500 “Search” Benchmark (graph500.org) • BFS (from a single vertex) on a static, undirected R-MAT network with average vertex degree 16. • Evaluation criteria: largest problem size that can be solved on a system, minimum execution time. • Reference MPI, shared memory implementations provided. • NERSC Franklin system is ranked #2 on current list (Nov 2010). • BFS using 500 nodes of Franklin

Parallel Single-source Shortest Paths (SSSP) algorithms • Edge weights: concurrency primary challenge! • No known PRAM algorithm that runs in sub-linear time and O(m+nlog n) work • Parallel priority queues: relaxed heaps [DGST88], [BTZ98] • Ullman-Yannakakis randomized approach [UY90] • Meyer and Sanders, ∆ - stepping algorithm [MS03] • Distributed memory implementations based on graph partitioning • Heuristics for load balancing and termination detection K. Madduri, D.A. Bader, J.W. Berry, and J.R. Crobak, “An Experimental Study of A Parallel Shortest Path Algorithm for Solving Large-Scale Graph Instances,” Workshop on Algorithm Engineering and Experiments (ALENEX), New Orleans, LA, January 6, 2007.

∆ - stepping algorithm [MS03] • Label-correcting algorithm: Can relax edges from unsettled vertices also • ∆ - stepping: “approximate bucket implementation of Dijkstra’s algorithm” • ∆: bucket width • Vertices are ordered using buckets representing priority range of size ∆ • Each bucket may be processed in parallel

∆ - stepping algorithm: illustration ∆ = 0.1 (say) 0.05 • One parallel phase • while (bucket is non-empty) • Inspect light edges • Construct a set of “requests” (R) • Clear the current bucket • Remember deleted vertices (S) • Relax request pairs in R • Relax heavy request pairs (from S) • Go on to the next bucket 0.56 3 6 0.07 0.01 0.15 0.23 4 2 0 0.02 0.13 0.18 5 1 d array 0 1 2 3 4 5 6 ∞ ∞ ∞ ∞ ∞ ∞ ∞ Buckets

∆ - stepping algorithm: illustration 0.05 • One parallel phase • while (bucket is non-empty) • Inspect light edges • Construct a set of “requests” (R) • Clear the current bucket • Remember deleted vertices (S) • Relax request pairs in R • Relax heavy request pairs (from S) • Go on to the next bucket 0.56 3 6 0.07 0.01 0.15 0.23 4 2 0 0.02 0.13 0.18 5 1 d array 0 1 2 3 4 5 6 ∞ ∞ ∞ ∞ ∞ ∞ 0 Buckets Initialization: Insert s into bucket, d(s) = 0 0 0

∆ - stepping algorithm: illustration • One parallel phase • while (bucket is non-empty) • Inspect light edges • Construct a set of “requests” (R) • Clear the current bucket • Remember deleted vertices (S) • Relax request pairs in R • Relax heavy request pairs (from S) • Go on to the next bucket 0.05 0.56 3 6 0.07 0.01 0.15 0.23 4 2 0 0.02 0.13 0.18 5 1 d array 0 1 2 3 4 5 6 ∞ ∞ ∞ ∞ ∞ ∞ 0 R 2 Buckets 0 0 .01 S

∆ - stepping algorithm: illustration 0.05 • One parallel phase • while (bucket is non-empty) • Inspect light edges • Construct a set of “requests” (R) • Clear the current bucket • Remember deleted vertices (S) • Relax request pairs in R • Relax heavy request pairs (from S) • Go on to the next bucket 0.56 3 6 0.07 0.01 0.15 0.23 4 2 0 0.02 0.13 0.18 5 1 d array 0 1 2 3 4 5 6 ∞ ∞ ∞ ∞ ∞ ∞ 0 2 Buckets R 0 .01 0 S

∆ - stepping algorithm: illustration 0.05 • One parallel phase • while (bucket is non-empty) • Inspect light edges • Construct a set of “requests” (R) • Clear the current bucket • Remember deleted vertices (S) • Relax request pairs in R • Relax heavy request pairs (from S) • Go on to the next bucket 0.56 3 6 0.07 0.01 0.15 0.23 4 2 0 0.02 0.13 0.18 5 1 d array 0 1 2 3 4 5 6 ∞ ∞ ∞ ∞ ∞ 0 .01 R Buckets 0 2 0 S

Parallel Graph Algorithms