1 / 43

Parallel Graph Algorithms

Parallel Graph Algorithms. Sathish Vadhiyar. Graph Traversal. Graph search plays an important role in analyzing large data sets Relationship between data objects represented in the form of graphs Breadth first search used in finding shortest path or sets of paths.

sunee
Download Presentation

Parallel Graph Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Graph Algorithms Sathish Vadhiyar

  2. Graph Traversal • Graph search plays an important role in analyzing large data sets • Relationship between data objects represented in the form of graphs • Breadth first search used in finding shortest path or sets of paths

  3. Level-synchronized algorithm • Proceeds level-by-level starting with the source vertex • Level of a vertex – its graph distance from the source • How to decompose the graph (vertices, edges and adjacency matrix) among processors?

  4. Distributed BFS with 1D Partitioning • Each vertex and edges emanating from it are owned by one processor • 1-D partitioning of the adjacency matrix • Edges emanating from vertex v is its edge list = list of vertex indices in row v of adjacency matrix A

  5. 1-D Partitioning • At each level, each processor owns a set F – set of frontier vertices owned by the processor • Edge lists of vertices in F are merged to form a set of neighboring vertices, N • Some vertices of N owned by the same processor, while others owned by other processors • Messages are sent to those processors to add these vertices to their frontier set for the next level

  6. Lvs(v) – level of v, i.e, graph distance from source vs

  7. 2D Partitioning • P=RXC processor mesh • Adjacency matric divided into R.C block rows and C block columns • A(i,j)(*) denotes a block owned by (i,j) processor; each processor owns C blocks

  8. 2D Partitioning • Processor (i,j) owns vertices belonging to block row (j-1).R+i • Thus a process stores some edges incident on its vertices, and some edges that are not

  9. 2D Paritioning • Assume that the edge list for a given vertex is the column of the adjacency matrix • Each block in the 2D partitioning contains partial edge lists • Each processor has a frontier set of vertices, F, owned by the processor

  10. 2D ParitioningExpand Operation • Consider v in F • The owner of v sends messages to other processors in frontier column telling that v is in the frontier; since any of these processors may have partial edge list of v

  11. 2D PartitioningFold Operation • Partial edge lists on each processor merged to form N – potential vertices in the next frontier • Vertices in N sent to their owners to form new frontier set F on those processors • These owner processors are in the same processor row • This communication step referred as fold operation

  12. Analysis • Advantage of 2D over 1D – processor-column and processor-row communications involve only R and C processors

  13. BFS on GPUs

  14. BFS on GPUs • One GPU thread for a vertex • In each iteration, each vertex looks at its entry in the frontier array • If true, it forms the neighbors and frontiers • Severe load imbalance among the treads • Scope for improvement

  15. Parallel Depth First Search • Easy to parallelize • Left subtree can be searched in parallel with the right subtree • Statically assign a node to a processor – the whole subtree rooted at that node can be searched independently. • Can lead to load imbalance; Load imbalance increases with the number of processors

  16. Dynamic Load Balancing (DLB) • Difficult to estimate the size of the search space beforehand • Need to balance the search space among processors dynamically • In DLB, when a processor runs out of work, it gets work from another processor

  17. Maintaining Search Space • Each processor searches the space depth-first • Unexplored states saved as stack; each processor maintains its own local stack • Initially, the entire search space assigned to one processor

  18. Work Splitting • When a processor receives work request, it splits its search space • Half-split: Stack space divided into two equal pieces – may result in load imbalance • Giving stack space near the bottom of the stack can lead to giving bigger trees • Stack space near the top of the stack tend to have small trees • To avoid sending very small amounts of work – nodes beyond a specified stack depth are not given away – cutoff depth

  19. Strategies • 1. Send nodes near the bottom of the stack • 2. Send nodes near the cutoff depth • 3. Send half the nodes between the bottom of the stack and the cutoff depth • Example: Figures 11.5(a) and 11.9

  20. Load Balancing Strategies • Asynchronous round-robin: Each processor has a target processor to get work from; the value of the target is incremented with modulo • Global round-robin: One single target processor variable is maintained for all processors • Random polling: randomly select a donor

  21. Termination Detection • Dijikstra’s Token Termination Detection Algorithm • Based on passing of a token in a logical ring; P0 initiates a token when idle; A processor holds a token until it has completed its work, and then passes to the next processor; when P0 receives again, then all processors have completed • However, a processor may get more work after becoming idle

  22. Algorithm Continued…. • Taken care of by using white and black tokens • Initially, the token is white; a processor j becomes black if it sends work to i<j • If j completes work, it changes token to black and sends it to next processor; after sending, changes to white. • When P0 receives a black token, reinitiates the ring

  23. Tree Based Termination Detection • Uses weights • Initially processor 0 has weight 1 • When a processor transfers work to another processor, the weights are halved in both the processors • When a processor finishes, weights are returned • Termination is when processor 0 gets back 1 • Goes with the DFS algorithm; No separate communication steps • Figure 11.10

  24. Minimal Spanning Tree, Single-Source and All-pairs Shortest Paths

  25. Minimal Spanning Tree – Prim’s Algorithm • Spanning tree of a graph, G (V,E) – tree containing all vertices of G • MST – spanning tree with minimum sum of weights • Vertices are added to a set Vt that holds vertices of MST; Initially contains an arbitrary vertex,r, as root vertex

  26. Minimal Spanning Tree – Prim’s Algorithm • An array d such that d[v in (V-Vt)] holds weight of the edge with least weight between v and any vertex in Vt; Initially d[v] = w[r,v] • Find the vertex in d with minimum weight and add to Vt • Update d • Time complexity – O(n2)

  27. Parallelization • Vertex V and d array partitioned across P processors • Each processor finds local minimum in d • Then global minimum across all d performed by reduction on a processor • The processor finds the next vertex u, and broadcasts to all processors

  28. Parallelization • All processors update d; The owning processor of u marks u as belonging to Vt • Process responsible for v must know w[u,v] to update v; 1-D block mapping of adjacency matrix • Complexity – O(n2/P) + (OnlogP) for communication

  29. Single Source Shortest Path – Dijikistra’s Algorithm • Finds shortest path from the source vertex to all vertices • Follows a similar structure as Prim’s • Instead of d array, an array l that maintains the shortest lengths are maintained • Follow similar parallelization scheme

  30. Single Source Shortest Path on GPUs

  31. SSSP on GPUs • A single kernel is not enough since Ca cannot be updated while it is accessed. • Hence costs updated in a temporary array Ua

  32. All-Pairs Shortest Paths • To find shortest paths between all pairs of vertices • Dijikstra’s algorithm for single-source shortest path can be used for all vertices • Two approaches

  33. All-Pairs Shortest Paths • Source-partitioned formulation: Partition the vertices across processors • Works well if p<=n; No communication • Can at best use only n processors • Time complexity? • Source-parallel formulation: Parallelize SSSP for a vertex across a subset of processors • Do for all vertices with different subsets of processors • Hierarchical formulation • Exploits more parallelism • Time complexity?

  34. All-Pairs Shortest PathsFloyd’s Algorithm • Consider a subset S = {v1,v2,…,vk} of vertices for some k <= n • Consider finding shortest path between vi and vj • Consider all paths from vi to vj whose intermediate vertices belong to the set S; Let pi,j(k) be the minimum-weight path among them with weight di,j(k)

  35. All-Pairs Shortest PathsFloyd’s Algorithm • If vk is not in the shortest path, then pi,j(k) = pi,j(k-1) • If vk is in the shortest path, then the path is broken into two parts – from vi to vk, and from vk to vj • So di,j(k) = min{di,j(k-1) , di,k(k-1) + dk,j(k-1) } • The length of the shortest path from vi to vj is given by di,j(n). • In general, solution is a matrix D(n)

  36. Parallel Formulation2-D Block Mapping • Processors laid in a 2D mesh • During kth iteration, each process Pi,j needs certain segments of the kth row and kth column of the D(k-1) matrix • For dl,r(k): following are needed • dl,k(k-1) (from a process along the same process row) • dk,r(k-1) (from a process along the same process column) • Figure 10.8

  37. Parallel Formulation2D Block Mapping • During kth iteration, each of the root(p) processes containing part of the kth row sends it to root(p)-1 in same column; • Similarly for the same row • Figure 10.8 • Time complexity?

  38. APSP on GPUs • Space complexity of Floyd’s algorithm is O(V2) – Impossible to go beyond a few vertices on GPUs • Uses V2 threads • A single O(V) operation looping over O(V2) threads - can exhibit slowdown due to high context switching overhead between threads • Use Dijikistra’s – run SSSP algorithm from every vertex in graph • Will require only the final output size to be O(V2) • Intermediate outputs on GPU can be O(V) and can be copied to CPU memory

  39. APSP on GPUs

  40. Sources/References • Paper: A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L. Yoo et al. SC 2005. • Paper:Accelerating large graph algorithms on the GPU usingCUDA. Harish and Narayanan. HiPC 2007.

  41. Speedup Anomalies in DFS • The overall work (space searched) in parallel DFS can be smaller or larger than in sequential DFS • Can cause superlinear or sublinear speedups • Figures 11.18, 11.19

  42. Parallel FormulationPipelining • In the 2D formulation, the kth iteration in all processes start only after k-1(th) iteration completes in all the processes • A process can start working on the kth iteration as soon as it has computed (k-1)th iteration and has relevant parts of the D(k-1) matrix • Example: Figure 10.9 • Time complexity

More Related