1 / 40

ChaMPIon/Pro TM : A High Performance Multithreaded Portable MPI-2 Implementation for ASCI Terascale Platforms and Linux

ChaMPIon/Pro TM : A High Performance Multithreaded Portable MPI-2 Implementation for ASCI Terascale Platforms and Linux Clusters. Rossen Dimitrov, Anthony Skjellum, Kumaran Rajaram, Weiyi Chen, Dave Leimbach, Srigurunath Chakravarthi, and Jothi P Neelamegam MPI Software Technology, Inc.

tait
Download Presentation

ChaMPIon/Pro TM : A High Performance Multithreaded Portable MPI-2 Implementation for ASCI Terascale Platforms and Linux

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ChaMPIon/ProTM: A High Performance Multithreaded Portable MPI-2 Implementation for ASCI Terascale Platforms and Linux Clusters Rossen Dimitrov, Anthony Skjellum, Kumaran Rajaram, Weiyi Chen, Dave Leimbach, Srigurunath Chakravarthi, and Jothi P Neelamegam MPI Software Technology, Inc. Ronald Brightwell – Sandia National Laboratory Bronis de Supinski and Terry Jones Lawrence Livermore National Laboratory Gary Grider and Marydell Nochumson Los Alamos National Laboratory

  2. Outline • Overview • Objectives • MPI-2 Features • ChaMPIon/Pro • Performance Results • Summary

  3. Overview of ChaMPIon/Pro • ChaMPIon/Pro™ is the first commercial MPI-2.1 version available for Linux. • ChaMPIon/Pro is a robust, scalable, high-performance, commercial MPI-2.1 Implementation from MPI Software Technology, Inc. • MercutIO™, a high performance portable MPI-IO implementation, which currently supports NFS, GPFS, and PVFS is included with ChaMPIon/Pro. • ChaMPIon/Pro works to retain system scalability for applications, while balancing performance criteria (such as latency vs. overhead) and resource utilization.

  4. DOE Collaborator’s Key Contributions • ASCI relevant input/requirements • Review of Designs specialized for ASCI systems • Validating performance on systems; advice/feedback • Attracting Production users • Co-design of PERUSE; Leading PERUSE forum • Test Suite Requirements/Ideas/Advice

  5. ChaMPIon/Pro’s Performance and Scalability Objectives • Scaling to thousands and tens of thousands of processors and beyond • Multi-device support • Topology awareness • Thread safety • Optimized collective operations • Optimized derived datatypes • Efficient memory (and NIC resource) usage

  6. ChaMPIon/Pro’s Functionality and Usability Objectives • Integration with schedulers and resource managers • Integration with debuggers and profilers • Functionality controlled by tunable parameters • Documentation • Reflect user feedback

  7. Major New Functionality in MPI-2 • Parallel I/O • One-sided communication • Dynamic process management • Extended collective operations • Improved error handling • Info object • External interfaces

  8. DOE Tri - lab Ultrascale Requirements ChaMPIon/Pro Commercial Baseline Commercial technology Product MPI/Pro CMPI ASCI Solutions New Ideas, know - how, & software MSTI Target Platforms I/O Devices Tools Support Communication Devices On-going R&D ChaMPIon/Pro Technology Evolution

  9. Point - to Matching and Scheduling, Progress Collectives • Datatypes Point Ordering Multi - Device, • Groups and • Virtual I/O, etc. Communicators Topologies • Tool Support • Error handling • Cached attributes Common Low - level Messaging & I/O Domain: Portals, LAPI, TCP/IP, VIA, GM, RACE, SMP, BAFS etc. Architecture (Baseline)

  10. Point - to Matching and Progress Scheduling, Collectives Point Ordering Multi - Device, I/O, etc. Common low - level messaging & I/O domain Architecture (Morphable) • Datatypes • Groups and • Virtual Topologies Communicators • Error handling • Tool Support Hardware/Firmware Middleware pushdown • Cached attributes Exploitable Semantics

  11. Collective Operations Multi-hierarchy Operations Class 1 Class 2 Class 3 Bcast Bcast Bcast Level 2 Reduce Reduce Reduce Level 1 Gather Gather Gather Level 0 Scatter Scatter Scatter Bcast

  12. Main Characteristics • Independent Message Progress • Multi-Device support • Fully multithreaded MPI-1, MPI I/O (MercutIO) • One-sided communication • Low CPU Overhead • Overlap of communication, computation, I/O • Thread Safe and Thread Aware [MPI_THREAD_MULTIPLE]. Works fully with OpenMP

  13. Platform Support • LLNL ASCI Blue (PPC 603e; IBM AIX; SP Switch/LAPI) • LLNL ASCI White (IBM Power 3; IBM AIX; SP Switch/LAPI) • Sandia Cplant (HP/Compaq Alpha; Linux; Myrinet/Portals) • COTS Clusters (Intel IA-32; Linux; TCP/IP; Myrinet/GM,InfiniBand/VAPI,Quadrics/ELAN)

  14. Communication Support • Portals • SP Switch/ LAPI • InfiniBand/VAPI • Quadrics/ELAN • Myrinet/GM1 and GM2 • TCP/IP • SMP

  15. MercutIO • The MPI-IO Component of ChaMPIon/Pro • Distributed File Systems: NFS, ENFS • Parallel File Systems: PVFS, GPFS • Cluster File Systems: Lustre, Panasas • Design and Optimizations (SCICOMP6)

  16. Integration with Tools and Resource Managers • Schedulers/Resource Manager • LLNL: GangLL (LoadLeveler) • LANL: • LSF • BPROC • Sandia: Cplant’s yod, yod2 • Etnus TotalView parallel debugger • Pallas Vampir performance profiler

  17. Miscellaneous Support • C and C++ Language Bindings; ISO FORTRAN 90 upcoming. • PERUSE Support (SCICOMP6) • Improved error handling • Extensive performance and correctness test suites • Customizable

  18. Performance Numbers GM version 2.0.2

  19. HPL Performance Numbers 64 Node cluster, 3GHz Xeon, 1 process per node

  20. Cross-Box Latency (Snow)

  21. Cross-Box Bandwidth (Snow)

  22. ChaMPIon/Pro Differentiation • ChaMPIon/Pro is the only MPI-2 implementation on Linux to offer all of the functionality of the MPI-2 standard. Also efficient. (http://www.lam-mpi.org/mpi/implementations/display.php?id=32) • MercutIO is more efficient than other MPI I/O systems in key performance benchmarks • (http://www.spscicomp.org/ScicomP6/Presentations/Rajaram/MercutIO.ppt) • ChaMPIon/Pro enables “shortest time-to-solution” for the real world application.

  23. Summary • ChaMPIon/Pro offers all of the robustness, scalability, and performance of MPI/Pro plus all MPI-2 features. • Support for number of target platforms, communication devices, file systems, and performance monitoring and debugging tools.

  24. Questions? This work was supported in part by a Small Business Innovation Phase I, II, and IIb Awards from the National Science Foundation, under Contracts DMI-9860997, DMI-9983413, amd DMP-0222804, respectively. Further work was performed under Contract W-7405-Eng-48, with the University of California as a subcontract B510240 to the Department of Energy, of the ASCI Pathforward Ultrascale Tools Initiative.

  25. Selected References, I • Rossen Dimitrov and Anthony Skjellum. A Theoretical Framework for Overlapping of Communication and Computation and Early Binding, part I: BOUM Model and Overlapping Metrics. Submitted to Parallel Computing. February 2003. • Rossen Dimitrov and Anthony Skjellum. A Theoretical Framework for Overlapping of Communication and Computation and Early Binding, part II: Early Binding. Submitted to Parallel Computing. June 2003. • Kumaran Rajaram, Anthony Skjellum, Rossen P. Dimitrov, Purushotham V. Bangalore, Vijay Velusamy, and David Leimbach. Design, Implementation, and Evaluation of a High Performance Portable Implementation of the MPI-2 I/O Standard API. Submitted to Parallel Computing. November 2002. • Gropp W., E. Lusk, N. Doss, and A. Skjellum. 1996. A High-performance, Portable Implementation of The MPI Message Passing Interface Standard, Parallel Computing, 22(6):789--828, September 1996.

  26. Selected References, II • Dimitrov, Rossen. 2001. Overlapping of communication and computation and early binding: Fundamental mechanisms for improving parallel performance on clusters of workstations. Ph.D. dissertation, Mississippi State University. http://library.msstate.edu/etd/show.asp?etd=etd-04092001-231941. • Dimitrov R. and A. Skjellum. 1999. An Efficient MPI Implementation for Virtual Interface (VI) Architecture-enabled Cluster Computing. In Proc. MPIDC'99, Message Passing Interface Developer's and User's Conference, pages 15--24, Atlanta, GA, March 1999. • Kumaran Rajaram. Principal design criteria influencing the performance of a portable high performance parallel I/O implementation. M.S. Thesis, Dept of Computer Science, Mississippi State University, May 2002. http://library.msstate.edu/etd/show.asp?etd=etd-04052002-105711 • William Gropp, Ewing Lusk, and Rajeev Thakur. 1999. Using MPI-2: Advanced features of the message-passing interface. Cambridge, MA: The MIT Press.

  27. Dr. Anthony Skjellum CTO 662-320-4300 x15 tony@mpi-softtech.com Kumaran Rajaram Senior Software Engineer 662-320-4300 x18 kums@mpi-softtech.com Dr. Rossen Dimitrov Principal Software Engineer 603-891-4766 rossen@mpi-softtech.com Contacts

  28. Main Characteristics • Highly optimized datatype management • Software engineering processes: SRSs, HLDs, DD’s before implementation (feedback from ultimate users) • Collectives with topology awareness • Optimized persistent mode of communication

  29. One-sided Communication • Complete implementation. Including Passive Synchronization, Accumulate operations, non-contiguous PUT and GET. • Independent progress thread. True one-sided effect for all operations. • TCP/GM/InfiniBand.

  30. Dynamic Process Creation • Spawn by mpirun. Does not require additional resource manager or daemon process running on each node. • Dynamic device initialization. Multi-device architecture. • Dynamic connection establishment. Compatible with MPI-1 static model.

  31. MercutIO vs. ROMIO • Hardware Configuration • Linux Cluster • 500 MHz Pentium II Processor, 512 MB RAM • 8 Nodes interconnected by 100 Mbps Fast Ethernet • Software Configuration • PVFS 1.5.4 • MPICH 1.2.4 • Access Pattern: Contiguous

  32. MercutIO vs. ROMIO (contd.)

  33. MercutIO vs. ROMIO (contd.)

  34. MercutIO vs IBM MPI-IO Implementation over GPFS

  35. MercutIO vs IBM MPI-IO Implementation • Hardware Configuration • IBM SP Cluster • 280 Nodes • 4 Processors per node (PowerPC 604e processors) • Total Memory: 512 GB • Total Disk Space: 16 TB GPFS, 3 TB local space. • Software Configuration • GPFS • Access Pattern: Strided and Segmented

  36. MercutIO vs. IBM MPI-IO Implementation: Strided Access Performance Platform = blue Geometry = 4 Nodes, 2 Tasks-per-node Iterations = 3 Transfer Size = 4MB Block Size = 4MB Stride Count = 100 Access pattern = Strided File Size = 12.5GB Collective = false

  37. APIs Write Bandwidth (MB/sec) Read Bandwidth (MB/sec) POSIX 181 243 IBM MPI (wo large block) 127 202 IBM MPI (w large block) 159 247 MercutIO 224 397 MercutIO vs. IBM MPI-IO Implementation: Strided Access Performance(contd.)

  38. MercutIO vs. IBM MPI-IO Implementation: Segmented Access Performance Platform = blue Geometry = 4 Nodes, 2 Tasks-per-node Iterations = 3 Transfer Size = 256KB Block Size = 128MB Stride Count = 1 Access pattern = Segmented File Size = 12.5GB Collective = false

  39. APIs Write Bandwidth (MB/sec) Read Bandwidth (MB/sec) POSIX 269 245 IBM MPI (wo large block) 122 156 IBM MPI (w large block) 335 240 MercutIO 221 449 MercutIO vs. IBM MPI-IO Implementation: Segmented Access Performance (contd.)

  40. PERUSE • PERUSE provides level of detail and accuracy of MPI performance data that is not possible through PMPI. • PERUSE helps in investigating hard performance and scalability issues. • PERUSE can be used to study the behavior of the MPI middleware as well as the behavior of the hardware in greater detail. • PERUSE can be used to complement the performance data accessible through PMPI. • MPI profiling tools can utilize PERUSE to provide additional services for performance analysis to MPI developers.

More Related