1 / 22

SGI Altix ICE Architecture Rev. 1.2a Kevin Nolte, SGI, Professional Services

SGI Altix ICE Architecture Rev. 1.2a Kevin Nolte, SGI, Professional Services. Altix ICE 8400. Altix ICE 8400 Rack: 42U rack (30” W x 40” D) 4 blade enclosures, each up to 16 two-socket nodes Single- or dual-plane IB 4 interconnect Minimal switch topology scales to 1000s of nodes. Rack.

reed
Download Presentation

SGI Altix ICE Architecture Rev. 1.2a Kevin Nolte, SGI, Professional Services

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SGI Altix ICE ArchitectureRev. 1.2aKevin Nolte, SGI, Professional Services

  2. Altix ICE 8400 • Altix ICE 8400 Rack: • 42U rack (30” W x 40” D) • 4 bladeenclosures, each up to 16 two-socket nodes • Single- or dual-plane IB 4 interconnect • Minimal switch topology scales to 1000s of nodes Rack Blade SGI®Altix® ICE Compute Blade Up to two 4-core sockets, 96GB, 2-IB

  3. Overview of IRU • The basic building block is an 18U-high IRU that contains the following: • Sixteen IP93 compute blades • Four network extender blades • One or two CMC blades • Six 2837 watt 12V power supplies and two 2837 watt 48V power supplies • IRU = Individual Rack Unit • CMC = Chassis Management Controller

  4. Terminology: Socket/Processor Sockets socket = processor = node

  5. SGI Altix ICE Application Environment PrimerRev. 1.2aKen Taylor, SGI, Professional Services

  6. Agenda • Application porting • Code optimization • Programming environment and libraries • Pinning for OpenMP and MPI • SGI-provided software tools

  7. Application Porting • Intel Xeon X5690 – x86_64 • 64-bit compiler and lib64 • -g –traceback –fpe0 (sets -ftz) • Data Representation: Little ENDIAN • -convert big_endian|ibm • env F_UFMTENDIAN=bigenv FORT_CONVERTn=big_endian • OPEN (UNIT=n, CONVERT=…) • Conversion performance impact

  8. Application Porting • Basic I/O Architecture Considerations • No local disk drive (NFS and Lustre FS) • /tmp is tmpfs 150 MB • Torque standard out and err in /var/spool 2 GB

  9. Application Porting • Fortran I/O • Fortran record length • 4 Byte unit • -assume byterecl • Fortran standard portable RECL specification using INQUIRE statement INQUIRE (IOLENGTH=iol) I, A, B, J

  10. Code Optimization • Compute • I/O • Communication

  11. Code Optimization • Key Parallel Programming Models • MPI-2.2 Standard • OpenMP 3.1 Standard • New Parallel Programming Models • SGI UPC • Fortran 2008 Coarrays (Intel ifort 12.1)

  12. Code Optimization • Code Vectorization • Intel SIMD • -xSSE4.2 (Westmere-EP processor) • -opt-report=3

  13. Code Optimization • I/O • Well-formed I/O • Lustre File System • Big I/O striping – lfs setstripe • Lustre caching and direct I/O • MPI I/O Lustre accelerator (SGI, Intel, MVAPICH2) • NFS • Better for small, random I/O (e.g. code compilations) • Parallel I/O issues • Shared file • Read all versus read one then broadcast

  14. Code Optimization • I/O • Intel Fortran I/O Library • FORT_BUFFERED, FORT_BLOCKSIZE, FORT_BUFFERCOUNT • Disable for small, random I/O • Fortran 2003 ASYNCHRONOUS='YES‘ • Linux • Linux Pagecache Scaling • cached too large • Direct I/O • st_blksize (stat command)

  15. Code Optimization • Communication • SGI MPT • MPI_BUFS_PER_PROC • MPI_STATS • MPInside 3.5.4 • MPI_BUFFER_MAX (single-copy) • MPI_IB_RAILS 2 • MPI_COLL_ • MPI_FASTSTART • IB Failover • MPI_IB_RAILS 2|1+

  16. Code Optimization • Communication • SGI MPT Always Set • MPI_VERBOSE • MPI_DISPLAY_SETTINGS • MPI_DSM_VERBOSE

  17. Programming Environment and Libraries • Module environment • Csh: source /usr/share/Modules/init/csh • Bash: . /usr/share/Modules/init/bash (module purge) module load modules # RHEL error module avail (prefix) module load mpt-2.06 module load intel-fc intel-cc intel-mkl

  18. Programming Environment and Libraries • SGI Libraries • SGI MPI 1.4 • SGI MPT 2.05 • SGI perfboost,perfcatcher,test • SGI omplace • SGI MPInside • SGI PerfSuite • SGI FFIO • Upcoming MPT 2.06 IB fail-over fixes and others

  19. SGI Provided Software Tools • SGI Tools • SGI perfboost,perfcatcher,test • SGI omplace • SGI MPInside • SGI PerfSuite • SGI FFIO • NUMA Tools • cpumap, dplace, dlook • Linux /sys/devices/system

  20. Pinning for OpenMP and MPI: SGI MPT • Placement Control for Mix of MPI and OpenMP • MPI_OPENMP_INTEROP • Preferred SGI MPT Method: mpirun –np ranks omplace [OPTIONS] program args • [OPTIONS] • -b basecpu: base cpu to begin allocating threads [default 0]. Relative to current cpuset • -c cpulist: defines effective cpulist • -nt threads: Defines the number of threads per MPI process [defaults to 1 or OMP_NUM_THREADS] • -vv: shows created dplace placement file • Distribute evenly between processors and LLC • Check topology

  21. Pinning for OpenMP and MPI: SGI MPT % mpirun -np 2 omplace -nt 4 -vv ./testmpiomp.x omplace information: MPI type is SGI MPI, 4 threads, thread model is intel placement file /tmp/omplace.file.13498: fork skip=0 exact cpu=0-23:4 thread oncpu=0 cpu=1-3 noplace=1 exact thread oncpu=4 cpu=5-7 noplace=1 exact thread oncpu=8 cpu=9-11 noplace=1 exact thread oncpu=12 cpu=13-15 noplace=1 exact thread oncpu=16 cpu=17-19 noplace=1 exact thread oncpu=20 cpu=21-23 noplace=1 exact MPI: dplace use detected, MPI_DSM_... environment variables ignored rank 0 name cam rank 1 name cam rank 0 np 2 nt 4 thread 0 i 1 cpu 0 rank 0 np 2 nt 4 thread 3 i 4 cpu 3 rank 0 np 2 nt 4 thread 1 i 2 cpu 1 rank 0 np 2 nt 4 thread 2 i 3 cpu 2 rank 1 np 2 nt 4 thread 0 i 1 cpu 4 rank 1 np 2 nt 4 thread 2 i 3 cpu 6 rank 1 np 2 nt 4 thread 3 i 4 cpu 7 rank 1 np 2 nt 4 thread 1 i 2 cpu 5

  22. Pinning for OpenMP and MPI: Intel MPI • Placement Control for Mix of MPI and OpenMP • Intel MPI and Intel OpenMP abstract specifications % mpirun-genv I_MPI_PIN_DOMAIN=cache -np 2 ./testmpiomp-impi.x rank 0 name cam rank 1 name cam rank 0 np 2 nt 4 thread 0 i 1 cpu 17 rank 0 np 2 nt 4 thread 3 i 4 cpu 16 rank 0 np 2 nt 4 thread 1 i 2 cpu 14 rank 0 np 2 nt 4 thread 2 i 3 cpu 15 rank 1 np 2 nt 4 thread 0 i 1 cpu 23 rank 1 np 2 nt 4 thread 2 i 3 cpu 21 rank 1 np 2 nt 4 thread 1 i 2 cpu 20 rank 1 np 2 nt 4 thread 3 i 4 cpu22 • Add KMP_AFFINITY for Intel OpenMP thread placements and pinning

More Related