1 / 1

Automatic Tuning of Collective Communications in MPI

Chain Tree. Binary Tree. Binomial Tree. Sequential Tree. Figure 1 : Trees. PASS most frequently chooses top performing implementations…. …but will occasionally search for better ones. Rajesh Nishtala, Kushal Chakrabarti, Neil Patel, Kaushal Sanghavi Computer Science Division

akiva
Download Presentation

Automatic Tuning of Collective Communications in MPI

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chain Tree Binary Tree Binomial Tree Sequential Tree Figure 1: Trees PASS most frequently chooses top performing implementations… …but will occasionally search for better ones Rajesh Nishtala, Kushal Chakrabarti, Neil Patel, Kaushal Sanghavi Computer Science Division University of California at Berkeley Automatic Tuning of Collective Communications in MPI • No single implementation of MPI collectives is optimal across all environmental variables (eg. node architecture and network load). • Our Probabilistic Algorithm Selection System (PASS) accounts for this and optimizes the collectives by learning from the implementations’ past performance • PASS optimizes operations above the level of underlying MPI point to point operations, making it cluster and implementation independent and thus adaptive and extensible. • Our new tuned implementations yield up to 10x speedups through the use of pipelining. • Performance results for four different implementations of MPI collectives were gathered varying the following parameters: • Cluster interconnect • Node architecture • Number of nodes • Pipeline segment size Figure 3: Because there is no definitive winning implementation of gather on CITRIS, transient cluster conditions could cause a superior implementation to emerge. PASS dynamically accounts for these changes and selects the optimal implementation. Figure 2:The best implementation varies across clusters and the number of nodes. For instance, the chain tree is always the best implementation on Seaborg, while it is only best for large numbers of nodes on Millennium. Because the binary tree is better for smaller numbers of nodes and the base implementation is suboptimal, space exists for performance tuning. Figure 5: PASS accounts for varying performance times (shown above) by usually choosing the optimal ones. Averaging over multiple trials, the difference between the PASS mean and median lines illustrate the effect of exploration. Future work will focus on fine-tuning PASS so that the negative effects of exploration are negligible. Figure 4: Pipelined transmission of messages yields significant improvements. Although performance is very sensitive to the unit of transmission (segment size), PASS will discover the optimal size. http://www.cs.berkeley.edu/~rajeshn/mpi_opt.pdf

More Related