1 / 98

On Optimizing Collective Communication

On Optimizing Collective Communication. UT/Texas Advanced Computing Center UT/Computer Science. Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de Geijn. ScicomP 10, August 9-13 Austin, TX. Outline. Model of Parallel Computation Collective Communications Algorithms

roanna-moon
Download Presentation

On Optimizing Collective Communication

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de Geijn ScicomP 10, August 9-13 Austin, TX

  2. Outline • Model of Parallel Computation • Collective Communications • Algorithms • Performance Results • Conclusions and Future work

  3. Model of Parallel Computation • Target Architectures • distributed memory parallel architectures • Indexing • p nodes • indexed 0 … p – 1 • each node has one computational processor

  4. 0 1 2 3 4 • often logically viewed as a linear array 5 6 7 8

  5. Model of Parallel Computation • Logically Fully Connected • a node can send directly to any other node • Communicating Between Nodes • a node can simultaneously receive and send • Network Conflicts • sending over a path between two nodes that is completely occupied

  6. Model of Parallel Computation • Cost of Communication • sending a message of length n between any two nodes •  is the startup cost (latency) •  is the transmission cost (bandwidth) • Cost of Computation • cost to perform arithmetic operation is  • reduction operations • sum • prod • min • max  +n

  7. Outline • Model of Parallel Computation • Collective Communications • Algorithms • Performance Results • Conclusions and Future work

  8. Collective Communications • Broadcast • Reduce(-to-one) • Scatter • Gather • Allgather • Reduce-scatter • Allreduce

  9. Lower Bounds (Latency) • Broadcast • Reduce(-to-one) • Scatter/Gather • Allgather • Reduce-scatter • Allreduce

  10. Lower Bounds (Bandwidth) • Broadcast • Reduce(-to-one) • Scatter/Gather • Allgather • Reduce-scatter • Allreduce

  11. Outline • Model of Parallel Computation • Collective Communications • Algorithms • Performance Results • Conclusions and Future work

  12. Motivating Example • We will illustrate the different types of algorithms and implementations using the Reduce-scatter operation.

  13. A building block approach to library implementation • Short-vector case • Long-vector case • Hybrid algorithms

  14. Short-vector case • Primary concern: • algorithms must have low latency cost • Secondary concerns: • algorithms must work for arbitrary number of nodes • in particular, not just for power-of-two numbers of nodes • algorithms should avoid network conflicts • not absolutely necessary, but nice if possible

  15. Minimum-Spanning Tree based algorithms • We will show how the following building blocks: • broadcast/reduce • scatter/gather • Using minimum spanning trees embedded in the logical linear array while attaining • minimal latency • implementation for arbitrary numbers of nodes • no network conflicts

  16. General principles • message starts on one processor

  17. General principles • divide logical linear array in half

  18. General principles • send message to the half of the network that does not contain the current node (root) that holds the message

  19. General principles • send message to the half of the network that does not contain the current node (root) that holds the message

  20. General principles • continue recursively in each of the two halves

  21. General principles • The demonstrated technique directly applies to • broadcast • scatter • The technique can be applied in reverse to • reduce • gather

  22. General principles • This technique can be used to implement the following building blocks: • broadcast/reduce • scatter/gather • Using a minimum spanning tree embedded in the logical linear array while attaining • minimal latency • implementation for arbitrary numbers of nodes • no network conflicts? • Yes, on linear arrays

  23. Reduce-scatter (short vector)

  24. Reduce-scatter (short vector) Reduce

  25. Reduce-scatter (short vector) Scatter

  26. Reduce Before Reduce + + + + + + + +

  27. Reduce Reduce After + + + + + + + +

  28. Cost of Minimum-Spanning Tree Reduce number of steps cost per steps

  29. Cost of Minimum-Spanning Tree Reduce number of steps cost per steps Notice: attains lower bound for latency component

  30. Scatter Before After

  31. Cost of Minimum-Spanning Tree Scatter • Assumption: power of two number of nodes

  32. Cost of Minimum-Spanning Tree Scatter • Assumption: power of two number of nodes Notice: attains lower bound for latency and bandwidth components

  33. Cost of Reduce/Scatter Reduce-scatter • Assumption: power of two number of nodes reduce scatter

  34. Cost of Reduce/Scatter Reduce-scatter • Assumption: power of two number of nodes reduce scatter Notice: does not attain lower bound for latency or bandwidth components

  35. Recap Reduce Reduce-scatter Scatter Allreduce Gather Allgather Broadcast

  36. A building block approach to library implementation • Short-vector case • Long-vector case • Hybrid algorithms

  37. Long-vector case • Primary concern: • algorithms must have low cost due to vector length • algorithms must avoid network conflicts • Secondary concerns: • algorithms must work for arbitrary number of nodes • in particular, not just for power-of-two numbers of nodes

  38. Long-vector building blocks • We will show how the following building blocks: • allgather/reduce-scatter • Can be implemented using “bucket” algorithms while attaining • minimal cost due to length of vectors • implementation for arbitrary numbers of nodes • no network conflicts

More Related