Message passing vs shared address space on a cluster of smps
This presentation is the property of its rightful owner.
Sponsored Links
1 / 20

Message Passing Vs. Shared Address Space on a Cluster of SMPs PowerPoint PPT Presentation


  • 35 Views
  • Uploaded on
  • Presentation posted in: General

Message Passing Vs. Shared Address Space on a Cluster of SMPs. Leonid Oliker NERSC/LBNL www.nersc.gov/~oliker Hongzhang Shan, Jaswinder Pal Singh Princeton University Rupak Biswas NASA Ames Research Center. Overview.

Download Presentation

Message Passing Vs. Shared Address Space on a Cluster of SMPs

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Message passing vs shared address space on a cluster of smps

Message Passing Vs. Shared Address Space on a Cluster of SMPs

Leonid Oliker

NERSC/LBNL

www.nersc.gov/~oliker

Hongzhang Shan, Jaswinder Pal Singh

Princeton University

Rupak Biswas

NASA Ames Research Center


Overview

Overview

  • Scalable computing using clusters of PCs has become an attractive platform for high-end scientific computing

  • Currently MP and SAS are the leading programming paradigms

    • MPI is more mature and provides performance & portability; however, code development can be very difficult

    • SAS provides substantial ease of programming, but performance may suffer due to poor spatial locality and protocol overhead

    • We compare performance of MP and SAS models using best implementations available to us (MPI/Pro and GeNIMA SVM)

  • Also examine hybrid programming (MPI + SAS)

  • Platform: eight 4-way 200 MHz Pentium Pro SMPs (32 procs)

  • Applications: regular (LU, OCEAN), irregular (RADIX, N-BODY)

  • Propose / investigate improved collective comm on SMP clusters


Architectural platform

Architectural Platform

32 Processor Pentium Pro System

4-way SMP

200MHz

8Kb L1

512Kb L2

512Mb Mem

Giganet or Myrinet Single crossbar switch

Network Interface

33MHz processor

Node-to-network bandwidth

constrained by 133MB/s PCI bus


Comparison of programming models

Comparison of programming models

SAS

MPI

P0 P1

P0 P1

A

A

A0

A1

Send

Receive

A1 = A0

Communication

Library

Load/Store

Send-Receive pair


Sas programming

SAS Programming

  • SAS in software: page-based shared virtual memory (SVM)

  • Use GeNIMA protocol built with VMMC on Myrinet network

  • VMMC – Virtual Memory Mapped Communication

    • Protected reliable user-level comm; variable size packets

    • Allows data transfer directly between two virtual memory address spaces

  • Single 16-way Myrinet crossbar switch

    • High-speed system area network with point-to-point links

    • Each NI connects nodes to network with two unidirectional links of 160 MB/s peak bandwidth

  • What is the SVM overhead compared with hardware supported cache-coherent system (Origin2000)?


Genima protocol

GeNIMA Protocol

  • GeNIMA (GEneral-purpose NI support in a shared Memory Abstraction): Synch home-based lazy-release consistency

  • Uses virtual memory mgmt sys for page-level coherence

  • Most current systems use asynchronous interrupts for both data exchange and protocol handling

  • Asynchronous message handling on network interface (NI) eliminates need to interrupt receiving host processor

  • Use general-purpose NI mechanism to move data between network and user-level memory & for mutual exclusion

  • Protocol handling on host processor at “synchronous” points – when a process is sending / receiving messages

  • Procs can modify local page copies until synchronization


Mp programming

MP Programming

  • Use MPI/Pro developed by VIA interface over Giganet

  • VIA - Virtual Interface Architecture

    • Industry standard interface for system area networks

    • Protected zero-copy user-space inter-process communication

  • Giganet (like Myrinet) NI use single crossbar switch

  • VIA and VMMC have similar communication overhead

Time (msecs)


Regular applications lu and ocean

Regular Applications:LU and OCEAN

  • LU factorization: Factors matrix into lower and upper tri

    • Lowest communication requirements among our benchmarks

    • One-to-many non-personalized communication

    • In SAS, each process directly fetches pivot block;in MPI, block owner sends pivot block to other processes

  • OCEAN: Models large-scale eddy and boundary currents

    • Nearest-neighbor comm patterns in a multigrid formation

    • Red-black Gauss-Seidel multigrid equation solver

    • High communication-to-computation ratio

    • Partitioning by rows instead of by blocks (fewer but larger messages) increased speedup from 14.1 to 15.2 (on 32 procs)

    • MP and SAS partition subgrids in the same way;but MPI involves more programming


Irregular applications radix and n body

Irregular Applications:RADIX and N-BODY

  • RADIXSorting: Iterative sorting based on histograms

    • Local histogram creates global histogram then permutes keys

    • Irregular all-to-all communication

    • Large comm-to-comp ratio, and high memorybandwidth requirement (can exceed capacity of PC-SMP)

    • SAS uses global binary prefix tree to collect local histogram;MPI uses Allgather (instead of fine-grained comm)

  • N-BODY: Simulates body interaction (galaxy, particle, etc)

    • 3D Barnes-Hut hierarchical octree method

    • Most complex code, highly irregular fine-grained comm

    • Compute forces on particles, then update their positions

    • Significantly different MPI and SAS tree-building algorithms


N body implementation differences

N-BODY Implementation Differences

SAS

MPI

Distribute / Collect

Cells / Particles


Improving n body sas implementation

Improving N-BODY SAS Implementation

SAS Shared Tree

Duplicatehigh-level cells

  • Algorithm becomes much more like message passing

  • Replication not “natural” programming style for SAS


Performance of lu

140

120

100

80

SYNC

60

40

RMEM

20

LOCAL

0

Performance of LU

  • Communication requirements small compared to our other apps

  • SAS and MPI have similar performance characteristics

  • Protocol overhead of running SAS version a small fraction of overall time (Speedups on 32p: SAS = 21.78, MPI = 22.43)

  • For applications with low comm requirements, it is possible to achieve high scalability on PC clusters using both MPI and SAS

MPI

SAS

Time (sec)

6144 x 6144 matrix on 32 processors


Performance of ocean

42

35

28

SYNC

21

14

RMEM

7

LOCAL

0

Performance of OCEAN

  • SAS performance significantly worse than MPI(Speedups on 32p: SAS = 6.49, MPI = 15.20)

  • SAS suffers from expensive synchronization overhead –after each nearest-neighbor comm, a barrier sync is required

  • 50% of sync overhead spent waiting, rest is protocol processing

  • Sync in MPI is much lower due to implicit send / receive pairs

MPI

SAS

Time (sec)

514 x 514 grid on 32 processors


Performance of radix

12

10

8

SYNC

6

4

RMEM

2

LOCAL

0

Performance of RADIX

  • MPI performance more than three times better than SAS(Speedups on 32p: SAS = 2.07, MPI = 7.78)

  • Poor SAS speedup due to memory bandwidth contention

  • Once again, SAS suffers from high protocol overhead of maintaining page coherence: compute diffs, create timestamps,generate write notices, and garbage collection

MPI

SAS

Time (sec)

32M integers on 32 processors


Performance of n body

7

6

5

4

SYNC

3

2

RMEM

1

LOCAL

0

Performance of N-BODY

  • SAS performance about half that of MPI(Speedups on 32p: SAS = 14.30, MPI = 26.94)

  • Synchronization overhead dominates SAS runtime

  • 82% of barrier time spent on protocol handling

  • If very high performance is the goal, message passing necessary for commodity SMP clusters

MPI

SAS

Time (sec)

128K particles on 32 processors


Origin2000 hardware cache coherency

Origin2000 (Hardware Cache Coherency)

Memory

Directory

Router

Dir (>32P)

Hub

R12K

R12K

L2

Cache

L2

Cache

Node Architecture

Communication Architecture

Previous results showed that on a hardware-supported cache-coherent multiprocessor platform, SAS achieved MPI performance for this set of applications


Hybrid performance on pc cluster

Hybrid Performance on PC Cluster

  • Latest teraflop-scale systems contain large number of SMPs;novel paradigm combines two layers of parallelism

  • Allows codes to benefit from loop-level parallelism and shared-memory algorithms in addition to coarse-grained parallelism

  • Tradeoff: SAS may reduce intra-SMP communication, but possibly incur additional overhead for explicit synchronization

  • Complexity example: Hybrid N-BODY requires two types of tree-building: MPI – distributed local tree, SAS – globally shared tree

  • Hybrid performance gain (11% max) does not compensate for increased programming complexity


Mpi collective function mpi allreduce

MPI Collective Function:MPI_Allreduce

  • How to better structure collective communication on PC-SMP clusters?

  • We explore algorithms for MPI_Allreduce and MPI_Allgather

  • MPI/Pro version labeled “Original” (exact algorithms undocumented)

  • For MPI_Allreduce, structure of our 4-way SMP motivates us to modify the deepest level of the B-Tree to a quadtree (B-Tree-4)

  • No difference in using SAS or MPI communication at lowest level

  • Execution time (in m secs) on 32 procs for one double-precision variable Original 1117 B-Tree 1035 B-Tree-4 981


Mpi collective function mpi allgather

MPI Collective Function:MPI_Allgather

  • Several algorithms were explored: Initially, B-Tree and B-Tree-4

  • B-Tree-4*: After a processor at Level 0 collects data, it sends it to Level 1 and below; however, Level 1 already contains data from its own subtree

  • Thus redundant to broadcast ALLthe data back, instead only the necessary data needs to be exchanged (can be extended to the lowest level of the tree (bounded by size of SMP))

  • Improved communication functions result in up to 9% performance gain (most time spent in send / receive functions)

Time (msecs) for P=32 (8 nodes)


Conclusions

Conclusions

  • Examined performance for several regular and irregular applications using MP (MPI/Pro on Giganet by VIA) and SAS (GeNIMA on Myrinet by VMMC) on 32-processor PC-SMP cluster

  • SAS provides substantial ease of programming, esp. for more complex codes which are irregular and dynamic

  • Unlike previous research on hardware-supported CC-SAS machines, SAS achieved about half the parallel efficiency of MPI for most of our applications (LU was an exception, where performance was similar)

  • High overhead for SAS due to excessive cost of SVM protocol associated with maintaining page coherence and implementing synch

  • Hybrid codes offered nosignificant performance advantage over pure MPI, but increased programming complexity and reduced portability

  • Presented new algorithms for improved SMP communication functions

  • If very high performance is the goal, the difficulty of MPI programming appears to be necessary for commodity SMP clusters of today


  • Login