1 / 25

Altix 4700

Altix 4700. ccNUMA Architecture. Distributed Memory - Shared address space. Altix HLRB II – Phase 2. 19 partitions with 9728 cores Each with 256 Itanium dual-core processors, i.e., 512 cores Clock rate 1.6 GHz 4 Flops per cycle per core 12,8 GFlop/s (6,4 GFlop/s per core)

reilly
Download Presentation

Altix 4700

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Altix 4700

  2. ccNUMA Architecture • Distributed Memory - Shared address space

  3. Altix HLRB II – Phase 2 • 19 partitions with 9728 cores • Each with 256 Itanium dual-core processors, i.e., 512 cores • Clock rate 1.6 GHz • 4 Flops per cycle per core • 12,8 GFlop/s (6,4 GFlop/s per core) • 13 high-bandwidth partitions • Blades with 1 processor (2 cores) and 4 GB memory • Frontside bus 533 MHz (8.5 GB/sec) • 6 high-density partitions • Blades with 2 processors (4 cores) and 4 GB memory. • Same memory bandwidth. • Peak Performance: 62,3 TFlops (6.4 GFlops/core) • Memory: 39 TB

  4. Memory Hierarchy • L1D • 16 KB, 1 cycle latency, 25,6 GB/s bandwidth • cache line size 64 bytes • L2D • 256 KB, 6 cycles, 51 GB/s • cache line size 128 bytes • L3 • 9 MB, 14 cycles, 51 GB/s • cache line size 128 bytes

  5. Interconnect • NUMAlink 4 • 2 links per blade • Each link 2*3,2 GB/s bandwidth • MPI latency 1-5µs

  6. Disks • Direct attached disks (temporary large files) • 600 TB • 40 GB/s bandwidth • Network attached disks (Home Directories) • 60 TB • 800 MB/s bandwidth

  7. Environment • Footprint: 24 m x 12 m • Weight: 103 metrictons • Electrical power: ~1 MW

  8. NUMAlink Building Block NUMALink 4RouterLevel 1 NUMALink 4RouterLevel 1 8 cores (high bandwidth) 16 cores (high-density) PCI/FC BLADE BLADE BLADE BLADE IO BLADE BLADE BLADE BLADE BLADE IO BLADE SANSwitch 10 GE NUMALink 4RouterLevel 1 NUMALink 4RouterLevel 1

  9. Blades and Rack

  10. Interconnection in a Partition

  11. Interconnection of Partitions • Gray squares • 1 partition with 512 cores • L: Login B:Batch • Lines • 2 NUMALink4 planes with 16 cables • each cable: 2 * 3,2 GB/s

  12. Interactive Partition • Login cores • 32 for compile & test • Interactive batch jobs • 476 cores • managed by PBS • daytime interactive usage • small-scale and nighttime batch processing • single partition only • High-density blades • 4 cores per memory 4 OS 16 Login 4 Login 16 12 Login 12 Batch 16 16 16 16 16 16 16 16

  13. 18 Batch Partitions • Batch jobs • 510 (508) cores • managed by PBS • large-scale parallel jobs • single or multi-partition jobs • 5 partitions with high-density blades • 13 partitions with high-bandwidth blades 4 OS 8 (16) 8 (16) 8 (16) 6 (12) 8 (16) 8 (16) 8 (16) 8 (16) 8 (16) 8 (16) 8 (16) 8 (16)

  14. Bandwidth

  15. Coherence Implementatioin • SHUB2 supports up to 8192 SHUBs (32768 cores) • Coherence domain up to 1024 SHUBs(4096 cores) • SGI term: "Sharing mode" • Directory with one bit per SHUB • Multiple shared copies are supported. • Accesses of other coherence domains • SGI term: "Exclusive sharing mode" • Always translated in exclusive access • Only single copy is supported • Directory stores the address of SHUB(13 bits)

  16. SHMEM Latency Model for Altix • SHMEM get latency is sum of: • 80 nsec for function call • 260 nsec for memory latency • 340 nsec for first hop • 60 nsec per hop • 20 nsec per meter of NUMAlink cable • Example • 64 P system: max hops is 4, max total cable length is 4. • Total SHMEM get latency is: 1000 nsec = 80 + 260 + 340 + 60x4 + 20x4

  17. Parallel Programming Models Intra-Host (512 cores) Intra-CoherencyDomain (4096 cores)and across entire machine Altix® System Coherency Domain 1 OpenMP Pthreads MPI SHMEMTM Global segments Linux Image 1 MPI SHMEM Global Segments Linux Image 2 Coherency Domain 2

  18. Barrier Synchronization • Frequent in OpenMP, SHMEM, MPI singlesidedops (MPI_Win_fence) • Tree-basedimplementationusing multiple fetch-op variables tominimizecontention on SHUB. • UsinguncachedloadtoreduceNUMAlinktraffic. CPU HUB ROUTER Fetch-op variable CPU

  19. Programming Models • OpenMP on an Linux image • MPI • SHMEM • Shared segments (System V und Global Shared Memory)

  20. SHMEM • Can be used for MPI programs where all processes execute same code. • Enables access within and across partitions. • Static data and symmetric heap data (shmalloc or shpalloc) • info: man intro_shmem

  21. Example #include <mpp/shmem.h> main() { long source[10] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 }; static long target[10]; MPI_Init(…) if (myrank == 0) { /* put 10 elements into target on PE 1 */ shmem_long_put(target, source, 10, 1); } shmem_barrier_all(); /* sync sender and receiver */ if (myrank == 1) printf("target[0] on PE %d is %d\n", myrank,target[0]); }

  22. Global Shared Memory Programming • Allocation of a shared memory segment via collective GSM_alloc. • Similar to memory mapped files or System V shared segments. But these are limited to a single OS instance. • GSM segment can be distributed across partitions. • GSM_ROUNDROBIN: Pages are distributed in roundrobin across processes • GSM_SINGLERANK: Places all pages near to a single process • GSM_CUSTOM_ROUNDROBIN: Each process specifies how many pages should be placed in its memory. • Data structures can be placed in this memory segment and accessed from all processes with normal load and store instructions.

  23. Example #include <mpi_gsm.h> placement = GSM_ROUNDROBIN; flags = 0; size = ARRAY_LEN * sizeof(int); int *shared_buf; rc = GSM_Alloc(size, placement, flags, MPI_COMM_WORLD,&shared_buf); // Have one rank initialize the shared memory region if (rank == 0) {     for(i=0; i < ARRAY_LEN; i++) {         shared_buf[i] = i;     } } MPI_Barrier(MPI_COMM_WORLD); // Have every rank verify they can read from the shared memory for (i=0; i < ARRAY_LEN; i++) {     if (shared_buf[i] != i) {         printf("ERROR!! element %d = %d\n", i, shared_buf[i]);         printf("Rank %d - FAILED shared memory test.\n", rank);         exit(1);     } }

  24. Summary • Altix 4700 is a ccNUMA system • >60 TFlop/s • MPI messages sent with two-copy or single-copy protocol • Hierarchical coherence implementation • Intranode • Coherence domain • Across coherence domains • Programming models • OpenMP • MPI • SHMEM • GSM

  25. The Compute Cube of LRZ

More Related