Message Passing Vs. Shared Address Space on a Cluster of SMPs

Message Passing Vs. Shared Address Space on a Cluster of SMPs Leonid Oliker NERSC/LBNL www.nersc.gov/~oliker Hongzhang Shan, Jaswinder Pal Singh Princeton University Rupak Biswas NASA Ames Research Center

Overview • Scalable computing using clusters of PCs has become an attractive platform for high-end scientific computing • Currently MP and SAS are the leading programming paradigms • MPI is more mature and provides performance & portability; however, code development can be very difficult • SAS provides substantial ease of programming, but performance may suffer due to poor spatial locality and protocol overhead • We compare performance of MP and SAS models using best implementations available to us (MPI/Pro and GeNIMA SVM) • Also examine hybrid programming (MPI + SAS) • Platform: eight 4-way 200 MHz Pentium Pro SMPs (32 procs) • Applications: regular (LU, OCEAN), irregular (RADIX, N-BODY) • Propose / investigate improved collective comm on SMP clusters

Architectural Platform 32 Processor Pentium Pro System 4-way SMP 200MHz 8Kb L1 512Kb L2 512Mb Mem Giganet or Myrinet Single crossbar switch Network Interface 33MHz processor Node-to-network bandwidth constrained by 133MB/s PCI bus

Comparison of programming models SAS MPI P0 P1 P0 P1 A A A0 A1 Send Receive A1 = A0 Communication Library Load/Store Send-Receive pair

SAS Programming • SAS in software: page-based shared virtual memory (SVM) • Use GeNIMA protocol built with VMMC on Myrinet network • VMMC – Virtual Memory Mapped Communication • Protected reliable user-level comm; variable size packets • Allows data transfer directly between two virtual memory address spaces • Single 16-way Myrinet crossbar switch • High-speed system area network with point-to-point links • Each NI connects nodes to network with two unidirectional links of 160 MB/s peak bandwidth • What is the SVM overhead compared with hardware supported cache-coherent system (Origin2000)?

GeNIMA Protocol • GeNIMA (GEneral-purpose NI support in a shared Memory Abstraction): Synch home-based lazy-release consistency • Uses virtual memory mgmt sys for page-level coherence • Most current systems use asynchronous interrupts for both data exchange and protocol handling • Asynchronous message handling on network interface (NI) eliminates need to interrupt receiving host processor • Use general-purpose NI mechanism to move data between network and user-level memory & for mutual exclusion • Protocol handling on host processor at “synchronous” points – when a process is sending / receiving messages • Procs can modify local page copies until synchronization

MP Programming • Use MPI/Pro developed by VIA interface over Giganet • VIA - Virtual Interface Architecture • Industry standard interface for system area networks • Protected zero-copy user-space inter-process communication • Giganet (like Myrinet) NI use single crossbar switch • VIA and VMMC have similar communication overhead Time (msecs)

Regular Applications:LU and OCEAN • LU factorization: Factors matrix into lower and upper tri • Lowest communication requirements among our benchmarks • One-to-many non-personalized communication • In SAS, each process directly fetches pivot block;in MPI, block owner sends pivot block to other processes • OCEAN: Models large-scale eddy and boundary currents • Nearest-neighbor comm patterns in a multigrid formation • Red-black Gauss-Seidel multigrid equation solver • High communication-to-computation ratio • Partitioning by rows instead of by blocks (fewer but larger messages) increased speedup from 14.1 to 15.2 (on 32 procs) • MP and SAS partition subgrids in the same way;but MPI involves more programming

Irregular Applications:RADIX and N-BODY • RADIXSorting: Iterative sorting based on histograms • Local histogram creates global histogram then permutes keys • Irregular all-to-all communication • Large comm-to-comp ratio, and high memorybandwidth requirement (can exceed capacity of PC-SMP) • SAS uses global binary prefix tree to collect local histogram;MPI uses Allgather (instead of fine-grained comm) • N-BODY: Simulates body interaction (galaxy, particle, etc) • 3D Barnes-Hut hierarchical octree method • Most complex code, highly irregular fine-grained comm • Compute forces on particles, then update their positions • Significantly different MPI and SAS tree-building algorithms

N-BODY Implementation Differences SAS MPI Distribute / Collect Cells / Particles

Improving N-BODY SAS Implementation SAS Shared Tree Duplicatehigh-level cells • Algorithm becomes much more like message passing • Replication not “natural” programming style for SAS

140 120 100 80 SYNC 60 40 RMEM 20 LOCAL 0 Performance of LU • Communication requirements small compared to our other apps • SAS and MPI have similar performance characteristics • Protocol overhead of running SAS version a small fraction of overall time (Speedups on 32p: SAS = 21.78, MPI = 22.43) • For applications with low comm requirements, it is possible to achieve high scalability on PC clusters using both MPI and SAS MPI SAS Time (sec) 6144 x 6144 matrix on 32 processors

42 35 28 SYNC 21 14 RMEM 7 LOCAL 0 Performance of OCEAN • SAS performance significantly worse than MPI(Speedups on 32p: SAS = 6.49, MPI = 15.20) • SAS suffers from expensive synchronization overhead –after each nearest-neighbor comm, a barrier sync is required • 50% of sync overhead spent waiting, rest is protocol processing • Sync in MPI is much lower due to implicit send / receive pairs MPI SAS Time (sec) 514 x 514 grid on 32 processors

12 10 8 SYNC 6 4 RMEM 2 LOCAL 0 Performance of RADIX • MPI performance more than three times better than SAS(Speedups on 32p: SAS = 2.07, MPI = 7.78) • Poor SAS speedup due to memory bandwidth contention • Once again, SAS suffers from high protocol overhead of maintaining page coherence: compute diffs, create timestamps,generate write notices, and garbage collection MPI SAS Time (sec) 32M integers on 32 processors

7 6 5 4 SYNC 3 2 RMEM 1 LOCAL 0 Performance of N-BODY • SAS performance about half that of MPI(Speedups on 32p: SAS = 14.30, MPI = 26.94) • Synchronization overhead dominates SAS runtime • 82% of barrier time spent on protocol handling • If very high performance is the goal, message passing necessary for commodity SMP clusters MPI SAS Time (sec) 128K particles on 32 processors

Origin2000 (Hardware Cache Coherency) Memory Directory Router Dir (>32P) Hub R12K R12K L2 Cache L2 Cache Node Architecture Communication Architecture Previous results showed that on a hardware-supported cache-coherent multiprocessor platform, SAS achieved MPI performance for this set of applications

Hybrid Performance on PC Cluster • Latest teraflop-scale systems contain large number of SMPs;novel paradigm combines two layers of parallelism • Allows codes to benefit from loop-level parallelism and shared-memory algorithms in addition to coarse-grained parallelism • Tradeoff: SAS may reduce intra-SMP communication, but possibly incur additional overhead for explicit synchronization • Complexity example: Hybrid N-BODY requires two types of tree-building: MPI – distributed local tree, SAS – globally shared tree • Hybrid performance gain (11% max) does not compensate for increased programming complexity

MPI Collective Function:MPI_Allreduce • How to better structure collective communication on PC-SMP clusters? • We explore algorithms for MPI_Allreduce and MPI_Allgather • MPI/Pro version labeled “Original” (exact algorithms undocumented) • For MPI_Allreduce, structure of our 4-way SMP motivates us to modify the deepest level of the B-Tree to a quadtree (B-Tree-4) • No difference in using SAS or MPI communication at lowest level • Execution time (in m secs) on 32 procs for one double-precision variable Original 1117 B-Tree 1035 B-Tree-4 981

MPI Collective Function:MPI_Allgather • Several algorithms were explored: Initially, B-Tree and B-Tree-4 • B-Tree-4*: After a processor at Level 0 collects data, it sends it to Level 1 and below; however, Level 1 already contains data from its own subtree • Thus redundant to broadcast ALLthe data back, instead only the necessary data needs to be exchanged (can be extended to the lowest level of the tree (bounded by size of SMP)) • Improved communication functions result in up to 9% performance gain (most time spent in send / receive functions) Time (msecs) for P=32 (8 nodes)

Conclusions • Examined performance for several regular and irregular applications using MP (MPI/Pro on Giganet by VIA) and SAS (GeNIMA on Myrinet by VMMC) on 32-processor PC-SMP cluster • SAS provides substantial ease of programming, esp. for more complex codes which are irregular and dynamic • Unlike previous research on hardware-supported CC-SAS machines, SAS achieved about half the parallel efficiency of MPI for most of our applications (LU was an exception, where performance was similar) • High overhead for SAS due to excessive cost of SVM protocol associated with maintaining page coherence and implementing synch • Hybrid codes offered nosignificant performance advantage over pure MPI, but increased programming complexity and reduced portability • Presented new algorithms for improved SMP communication functions • If very high performance is the goal, the difficulty of MPI programming appears to be necessary for commodity SMP clusters of today

Message Passing Vs. Shared Address Space on a Cluster of SMPs

Message Passing Vs. Shared Address Space on a Cluster of SMPs

Presentation Transcript

Message Passing Basics

Message Passing

Shared Address Space Computing: Hardware Issues

Message Passing

Message-Passing

Message Passing

Message Passing

Message Passing Systems Packaging Design Space

Message Passing

Shared Memory and Message Passing

Message Passing

Basics of Message-passing

Shared Address Space Processors

Paradigma “Shared Address Space”

Shared Address Space Computing: Hardware Issues

Visualization of Message Passing

Programming Shared Address Space Platforms

Programming Shared Address Space Platforms

Programming Shared Address Space Platforms

Dedicated IP address vs Shared IP address