Shmem programming model
This presentation is the property of its rightful owner.
Sponsored Links
1 / 27

SHMEM Programming Model PowerPoint PPT Presentation


  • 149 Views
  • Uploaded on
  • Presentation posted in: General

SHMEM Programming Model. Hung-Hsun Su UPC Group, HCS lab 1/23/2004. Outline. Background Nuts and Bolts GPSHMEM Performance Conclusion Reference. Background What is SHMEM?. SHard MEMory library Based on SPMD model Available for C / Fortran

Download Presentation

SHMEM Programming Model

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Shmem programming model

SHMEM Programming Model

Hung-Hsun Su

UPC Group, HCS lab

1/23/2004


Outline

Outline

  • Background

  • Nuts and Bolts

  • GPSHMEM

  • Performance

  • Conclusion

  • Reference


Background what is shmem

BackgroundWhat is SHMEM?

  • SHard MEMory library

  • Based on SPMD model

  • Available for C / Fortran

  • Hybrid Message Passing / Shared Memory Programming Model

    • Message Passing Like

      • Explicit communication, replication and synchronization

      • Specification of remote data location (processor id) is required

    • Shard Memory like

      • Provides logically shared memory system view

      • Communication require processor on one side only

        • Allows any processor element (PE) to access memory in a remote PE without involving the microprocessor on the remote PE (put / get)

        • Non-blocking data transfer


Background what is shmem1

BackgroundWhat is SHMEM?

  • Must know the address of a variable on the remote processor for transfer

    • same on all PEs

  • Remotely accessible data objects (Symmetric Vars.)

    • Global variables

    • Local static variables

    • Variables in common blocks

    • Fortran variables modified by a !DIR$ SYMMETRIC directive

    • C variables modified by a #pragma symmetric directive


Background why program in shmem

BackgroundWhy program in SHMEM?

  • Easier to program in than MPI / PVM

  • Low latency, high bandwidth data transfer

    • Puts

    • Gets

  • Provide efficient collective communication

    • Gather / Scatter

    • All-to-all

    • Broadcast

    • Reductions

  • Provide mechanisms to implement mutual exclusion

    • Atomic swap

    • Locking

  • Provide synchronization mechanisms

    • Barrier

    • Fence, Quiet


Background supported platforms

BackgroundSupported Platforms

  • SHMEM

    • Cray T3D, T3E, PVP

    • SGI Irix, Origin

    • Compaq SC

    • IBM SP

    • Quadrics Linux Cluster

    • SCI (?)

  • GPSHMEM (Version 1.0)

    • IBM SP

    • SGI Origin

    • Cray J90, T3E

    • Unix/Linux

    • Windows NT

    • Myrinet (?)


Nuts bolts initialization

Nuts & BoltsInitialization

  • Include header shmem.h / shmem.fh to access the library

  • shmem_init() – Initializes SHMEM

  • my_pe() – Get the PE ID of local processor

  • num_pes() – Get the total number of PE in the system

    #include <stdio.h>

    #include <stdlib.h>

    #include "shmem.h“

    int main(int argc, char **argv)

    {int my_pe, num_pe;shmem_init();my_pe = my_pe();num_pe = num_pes();printf("Hello World from process %d of %d\n", my_pe, num_pes);exit(0);

    }


Nuts bolts data transfer

Nuts & BoltsData Transfer

  • Put

    • Specific Variable

      • void shmem_TYPE_p(TYPE *addr, TYPE value, int pe)

        • TYPE = double, float, int, long, short

    • Contiguous Object

      • void shmem_put(void *target, const void *source, size_t len, int pe)

      • void shmem_TYPE_put(TYPE *target, const TYPE*source, size_t len, int pe)

        • TYPE = double, float, int, long, longdouble, longlong, short

      • void shmem_putSS(void *target, const void *source, size_t len, int pe)

        • Storage Size (SS) = 32, 64 (default), 128, mem (any size)


Nuts bolts data transfer1

Nuts & BoltsData Transfer

  • Get

    • Specific Variable

      • void shmem_TYPE_g(TYPE *addr, TYPE value, int pe)

        • TYPE = double, float, int, long, short

    • Contiguous Object

      • void shmem_get(void *target, const void *source, size_t len, int pe)

      • void shmem_TYPE_get(TYPE *target, const TYPE*source, size_t len, int pe)

        • TYPE = double, float, int, long, longdouble, longlong, short

      • void shmem_getSS(void *target, const void *source, size_t len, int pe)

        • Storage Size (SS) = 32, 64 (default), 128, mem (any size)


Nuts bolts collective communication

Nuts & BoltsCollective Communication

  • Broadcast

    • void shmem_broadcast(void *target, void *source, int nlong, int PE_root, int PE_start, int PE_group, int PE_size, long *pSync)

    • One-to-all communication

  • Collection

    • void shmem_collect(void *target, void *source, int nlong, int PE_start, int PE_group, int PE_size, long *pSync)

    • void shmem_fcollect(void *target, void *source, int nlong, int PE_start, int PE_group, int PE_size, long *pSync)

    • Concatenates data items from the source array into the target array over the defined set of PEs. The resultant target array consists of the contribution from the 1st PE, followed by 1st PE + 2nd PE, etc.

pSync - symmetric work array. Every element of this array must be initialized with the value _SHMEM_SYNC_VALUE before any of the PEs in the active set enter the routine. Use to prevent overlapping collective communication


Nuts bolts synchronization

Nuts & BoltsSynchronization

  • Barrier

    • void shmem_barrier_all(void)

      • Suspend all operations until all PE calls this function

    • void shmem_barrier(int PE_start, int PE_group, int PE_size, long *pSync)

      • Barrier operation on subset of PEs

  • Wait

    • Suspend until a remote PE writes a value NOT equal to the one specified

    • void shmem_wait(long *var, long value)

    • void shmem_TYPE_wait(TYPE *var, TYPE value)

      • TYPE = int, long, longlong, short

  • Conditional Wait

    • Same as wait except the comparison can now be >=, >, =, !=, <, <=

    • void shmem_wait_until(long *var, int cond, long value)


Nuts bolts synchronization1

Nuts & BoltsSynchronization

  • Fence

    • All put operations issued to a particular PE prior to call are guaranteed to be delivered before any subsequent remote write operation to the same PE which follows the call

    • Ensures ordering of remote write (put) operations

  • Quiet

    • Waits for completion of all outstanding remote writes initiated from the calling PE


Nuts bolts atomic operations

Nuts & BoltsAtomic Operations

  • Atomic Swap

    • Unconditional

      • long shmem_swap(long *target, long value, int pe)

    • Conditional

      • int shmem_int_cswap(int *target, int cond, int value, int pe)

  • Arithmetic

    • add, increment

      • int shmem_int_fadd(int *target, int value, int pe)


Nuts bolts collective reduction

Nuts & BoltsCollective Reduction

  • Collective logical operations

    • and, or, xor

    • void shmem_int_and_to_all(int *target, int *source, int nreduce, int PE_start, int PE_group, int PE_size, int *pWrk, long *pSync)

  • Collective comparison operations

    • max, min

    • void shmem_double_max_to_all(double *target, double *source, int nreduce, int PE_start, int PE_group, int PE_size, double *pWrk, long *pSync)

  • Collective arithmetic operations

    • product, sum

    • void shmem_double_prod_to_all(double *target, double *source, int nreduce, int PE_start, int PE_group, int PE_size, double *pWrk, long *pSync)


Nuts bolts other

Nuts & BoltsOther

  • Address Manipulation

    • shmem_ptr - Returns a pointer to a data object on a remote PE

  • Cache Control

    • shmem_clear_cache_inv - Disables automatic cache coherency mode

    • shmem_set_cache_inv - Enables automatic cache coherency mode

    • shmem_set_cache_line_inv - Enables automatic line cache coherency mode

    • shmem_udcflush - Makes the entire user data cache coherent

    • shmem_udcflush_line - Makes coherent a cache line


Nuts bolts example array copy

Nuts & BoltsExample (Array copy)

14. /* Initialize and send on PE 1 */ 15. if(me == 1) { 16. for(i=0; i<8; i++) 17. source[i] = i+1; 18. shmem_put64(dest, source, 8*sizeof(dest[0])/8, 0); 19. } 20. 21. /* Make sure the transfer is complete */ 22. shmem_barrier_all(); 23. 24. /* Print from the receiving PE */ 25. if(me == 0) { 26. _shmem_udcflush(); 27. printf(" DEST ON PE 0:"); 28. for(i=0; i<8; i++) 29. printf(" %d%c", dest[i], (i<7) ? ',' : '\n');

30. }}

1. #include <stdio.h> 2. #include <mpp/shmem.h> 3. #include <intrinsics.h> 4. 6. int me, npes, i; 7. int source[8], dest[8]; 8. main() 9. { 10. /* Get PE information */ 11. me = _my_pe(); 12. npes = _num_pes(); 13.


Gpshmem

GPSHMEM

  • AMES Lab / Pacific Northwest National Lab collaborative project

  • Communication library like SHMEM library, but tries to achieve full portability

  • Mostly the T3D components with some “extensions” of functionality

  • Research Quality at this point

ARMCI = A Portable Remote Memory Copy Library for Distributed Array Libraries and Compiler Run-time Systems


Performance latency origin 2000

Performance – Latency (Origin 2000)


Performance latency t3e 600

Performance – Latency (T3E 600)


Performance bandwidth

Performance – Bandwidth

Taken from http://infm.cineca.it/documenti/incontro_infm/comunicazio/sld015.htm


Performance bandwidth1

Performance – Bandwidth


Performance broadcast

Performance - Broadcast


Performance all to all

Performance – All to all


Performance ocean

Performance – Ocean

On SGI Origin 2000


Performance radix

Performance – Radix

On SGI Origin 2000


Conclusion

Conclusion

  • Hybrid MP/Shard Memory programming model

  • Compare to MP

    • Pro.

      • Easier to use

      • Lower latency, higher bandwidth communication

      • More scalable (within limit)

      • Remote CPU not interrupted during transfer

    • Con.

      • Limited platform support (as of now)


Reference

Reference

  • Ricky A. Kendall et. al., GPSHMEM and other Parallel Programming Models Powerpoint presentation

  • Hongzhang Shan and Jaswinder Pal Singh, A Comparison of MPI, SHMEM and Cache-coherent Shared Address Space Programming Models on the SGI Origin2000 http://citeseer.nj.nec.com/rd/48418321%2C296348%2C1%2C0.25%2CDownload/http://citeseer.nj.nec.com/cache/papers/cs/14068/http:zSzzSzwww.cs.princeton.eduzSz%7EshzzSzpaperszSzics99.pdf/a-comparison-of-mpi.pdf

  • Quadrics SHMEM Programming Manual http://www.psc.edu/~oneal/compaq/ShmemMan.pdf

  • Karl Feind, Shared Memory Access (SHMEM) Routines

  • Glenn Leucke et. al., The Performance and Scalability of SHMEM and MPI-2 One-Sided Routines on a SCI Origin 2000 and a Cray T3E-600 http://dsg.port.ac.uk/Journals/PEMCS/papers/paper19.pdf

  • Patrick H. Worley,CCSM Component Performance Benchmarking and Status of the CRAY X1 at ORNL http://www.csm.ornl.gov/~worley/talks/index.html


  • Login