Parallel Graph Algorithms

Parallel Graph Algorithms Sathish Vadhiyar

Graph Traversal • Graph search plays an important role in analyzing large data sets • Relationship between data objects represented in the form of graphs • Breadth first search used in finding shortest path or sets of paths

Level-synchronized algorithm • Proceeds level-by-level starting with the source vertex • Level of a vertex – its graph distance from the source • How to decompose the graph (vertices, edges and adjacency matrix) among processors?

Distributed BFS with 1D Partitioning • Each vertex and edges emanating from it are owned by one processor • 1-D partitioning of the adjacency matrix • Edges emanating from vertex v is its edge list = list of vertex indices in row v of adjacency matrix A

1-D Partitioning • At each level, each processor owns a set F – set of frontier vertices owned by the processor • Edge lists of vertices in F are merged to form a set of neighboring vertices, N • Some vertices of N owned by the same processor, while others owned by other processors • Messages are sent to those processors to add these vertices to their frontier set for the next level

Lvs(v) – level of v, i.e, graph distance from source vs

2D Partitioning • P=RXC processor mesh • Adjacency matric divided into R.C block rows and C block columns • A(i,j)(*) denotes a block owned by (i,j) processor; each processor owns C blocks

2D Partitioning • Processor (i,j) owns vertices belonging to block row (j-1).R+i • Thus a process stores some edges incident on its vertices, and some edges that are not

2D Paritioning • Assume that the edge list for a given vertex is the column of the adjacency matrix • Each block in the 2D partitioning contains partial edge lists • Each processor has a frontier set of vertices, F, owned by the processor

2D ParitioningExpand Operation • Consider v in F • The owner of v sends messages to other processors in frontier column telling that v is in the frontier; since any of these processors may have partial edge list of v

2D PartitioningFold Operation • Partial edge lists on each processor merged to form N – potential vertices in the next frontier • Vertices in N sent to their owners to form new frontier set F on those processors • These owner processors are in the same processor row • This communication step referred as fold operation

Analysis • Advantage of 2D over 1D – processor-column and processor-row communications involve only R and C processors

BFS on GPUs

BFS on GPUs • One GPU thread for a vertex • In each iteration, each vertex looks at its entry in the frontier array • If true, it forms the neighbors and frontiers • Severe load imbalance among the treads • Scope for improvement

Parallel Depth First Search • Easy to parallelize • Left subtree can be searched in parallel with the right subtree • Statically assign a node to a processor – the whole subtree rooted at that node can be searched independently. • Can lead to load imbalance; Load imbalance increases with the number of processors

Dynamic Load Balancing (DLB) • Difficult to estimate the size of the search space beforehand • Need to balance the search space among processors dynamically • In DLB, when a processor runs out of work, it gets work from another processor

Maintaining Search Space • Each processor searches the space depth-first • Unexplored states saved as stack; each processor maintains its own local stack • Initially, the entire search space assigned to one processor

Work Splitting • When a processor receives work request, it splits its search space • Half-split: Stack space divided into two equal pieces – may result in load imbalance • Giving stack space near the bottom of the stack can lead to giving bigger trees • Stack space near the top of the stack tend to have small trees • To avoid sending very small amounts of work – nodes beyond a specified stack depth are not given away – cutoff depth

Strategies • 1. Send nodes near the bottom of the stack • 2. Send nodes near the cutoff depth • 3. Send half the nodes between the bottom of the stack and the cutoff depth • Example: Figures 11.5(a) and 11.9

Load Balancing Strategies • Asynchronous round-robin: Each processor has a target processor to get work from; the value of the target is incremented with modulo • Global round-robin: One single target processor variable is maintained for all processors • Random polling: randomly select a donor

Termination Detection • Dijikstra’s Token Termination Detection Algorithm • Based on passing of a token in a logical ring; P0 initiates a token when idle; A processor holds a token until it has completed its work, and then passes to the next processor; when P0 receives again, then all processors have completed • However, a processor may get more work after becoming idle

Algorithm Continued…. • Taken care of by using white and black tokens • Initially, the token is white; a processor j becomes black if it sends work to i<j • If j completes work, it changes token to black and sends it to next processor; after sending, changes to white. • When P0 receives a black token, reinitiates the ring

Tree Based Termination Detection • Uses weights • Initially processor 0 has weight 1 • When a processor transfers work to another processor, the weights are halved in both the processors • When a processor finishes, weights are returned • Termination is when processor 0 gets back 1 • Goes with the DFS algorithm; No separate communication steps • Figure 11.10

Minimal Spanning Tree, Single-Source and All-pairs Shortest Paths

Minimal Spanning Tree – Prim’s Algorithm • Spanning tree of a graph, G (V,E) – tree containing all vertices of G • MST – spanning tree with minimum sum of weights • Vertices are added to a set Vt that holds vertices of MST; Initially contains an arbitrary vertex,r, as root vertex

Minimal Spanning Tree – Prim’s Algorithm • An array d such that d[v in (V-Vt)] holds weight of the edge with least weight between v and any vertex in Vt; Initially d[v] = w[r,v] • Find the vertex in d with minimum weight and add to Vt • Update d • Time complexity – O(n2)

Parallelization • Vertex V and d array partitioned across P processors • Each processor finds local minimum in d • Then global minimum across all d performed by reduction on a processor • The processor finds the next vertex u, and broadcasts to all processors

Parallelization • All processors update d; The owning processor of u marks u as belonging to Vt • Process responsible for v must know w[u,v] to update v; 1-D block mapping of adjacency matrix • Complexity – O(n2/P) + (OnlogP) for communication

Single Source Shortest Path – Dijikistra’s Algorithm • Finds shortest path from the source vertex to all vertices • Follows a similar structure as Prim’s • Instead of d array, an array l that maintains the shortest lengths are maintained • Follow similar parallelization scheme

Single Source Shortest Path on GPUs

SSSP on GPUs • A single kernel is not enough since Ca cannot be updated while it is accessed. • Hence costs updated in a temporary array Ua

All-Pairs Shortest Paths • To find shortest paths between all pairs of vertices • Dijikstra’s algorithm for single-source shortest path can be used for all vertices • Two approaches

All-Pairs Shortest Paths • Source-partitioned formulation: Partition the vertices across processors • Works well if p<=n; No communication • Can at best use only n processors • Time complexity? • Source-parallel formulation: Parallelize SSSP for a vertex across a subset of processors • Do for all vertices with different subsets of processors • Hierarchical formulation • Exploits more parallelism • Time complexity?

All-Pairs Shortest PathsFloyd’s Algorithm • Consider a subset S = {v1,v2,…,vk} of vertices for some k <= n • Consider finding shortest path between vi and vj • Consider all paths from vi to vj whose intermediate vertices belong to the set S; Let pi,j(k) be the minimum-weight path among them with weight di,j(k)

All-Pairs Shortest PathsFloyd’s Algorithm • If vk is not in the shortest path, then pi,j(k) = pi,j(k-1) • If vk is in the shortest path, then the path is broken into two parts – from vi to vk, and from vk to vj • So di,j(k) = min{di,j(k-1) , di,k(k-1) + dk,j(k-1) } • The length of the shortest path from vi to vj is given by di,j(n). • In general, solution is a matrix D(n)

Parallel Formulation2-D Block Mapping • Processors laid in a 2D mesh • During kth iteration, each process Pi,j needs certain segments of the kth row and kth column of the D(k-1) matrix • For dl,r(k): following are needed • dl,k(k-1) (from a process along the same process row) • dk,r(k-1) (from a process along the same process column) • Figure 10.8

Parallel Formulation2D Block Mapping • During kth iteration, each of the root(p) processes containing part of the kth row sends it to root(p)-1 in same column; • Similarly for the same row • Figure 10.8 • Time complexity?

APSP on GPUs • Space complexity of Floyd’s algorithm is O(V2) – Impossible to go beyond a few vertices on GPUs • Uses V2 threads • A single O(V) operation looping over O(V2) threads - can exhibit slowdown due to high context switching overhead between threads • Use Dijikistra’s – run SSSP algorithm from every vertex in graph • Will require only the final output size to be O(V2) • Intermediate outputs on GPU can be O(V) and can be copied to CPU memory

APSP on GPUs

Sources/References • Paper: A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L. Yoo et al. SC 2005. • Paper:Accelerating large graph algorithms on the GPU usingCUDA. Harish and Narayanan. HiPC 2007.

Speedup Anomalies in DFS • The overall work (space searched) in parallel DFS can be smaller or larger than in sequential DFS • Can cause superlinear or sublinear speedups • Figures 11.18, 11.19

Parallel FormulationPipelining • In the 2D formulation, the kth iteration in all processes start only after k-1(th) iteration completes in all the processes • A process can start working on the kth iteration as soon as it has computed (k-1)th iteration and has relevant parts of the D(k-1) matrix • Example: Figure 10.9 • Time complexity

Parallel Graph Algorithms