Parallel Programming on the SGI Origin2000

Parallel Programming on the SGI Origin2000 Taub Computer Center Technion Moshe Goldberg, mgold@tx.technion.ac.il With thanks to Igor Zacharov / Benoit Marchand, SGI Mar 2004 (v1.2)

Parallel Programming on the SGI Origin2000 • Parallelization Concepts • SGI Computer Design • Efficient Scalar Design • Parallel Programming -OpenMP • Parallel Programming- MPI

2) SGI Computer Design

Origin2000/3000 architecture features Important hardware and software components: * node board: processors + memory * node interconnect topology and configurations * scalability of the architecture * directory-based cache coherency * single system image components

Origin2000 node board

Origin node board HUB crossbar ASIC: - Single chip integrates all four functions: * processor interface: two rxK processors on the same bus * memory interface, integrating the memory controller and (direct) cache coherency * interface to CrayLink Interconnect to other nodes in the system * interface to I/O defices with XIO-to-PCI bridges - Memory access characteristics: * read bandwidth single processor 460 MB/s sustained * average access latency 315 ns to restart processor pipeline

Origin2000 node components

Origin router interconnect - Router chip has 6 CrayLink interfaces: 2 for connections to nodes (HUBs) and 4 for connections to other routers in the network * 4-dimensional interconnect - The interconnect topology is determined by the size of the computer (number of nodes): * direct (back-to-back) connection for 2 nodes (4 cpu) * strongly connected cube up to 32 cpu * hypercube for up to 64 cpu * hypercube of hypercubes for up to 256 cpu

Origin2000 – two nodes

Origin2000 module connections

Origin2000 interconnect

Origin2000 interconnect 32 processors 64 processors

Origin2000 interconnect

Directory-based uniform cache Cache line use is recorded in directory, which resides in memory

Origin cache coherence - Memory page is divided in data blocks of 32 words or 128 bytes each (L2 cache line size) - Each data request transfers one data block (128 bytes) - Each data block has associated presence and state information directory memory . . . . . . . . . . . . presence state 64 bits 3 bits data block (cache line) 128 bytes (32 words) - If a node (HUB) requests a data block, the corresponding presencebit is set and the state of that cache line is recorded - HUB runs the cache coherence protocol, updating the state of the data block and notifying nodes for which the presence bit is set

Origin address space - Physically the memory is distributed and not contiguous - Node id is assigned at boot time - Logically memory is a shared single contiguous address space, the virtual address space is 44 bits (16 TB) - A program (compiler) uses the virtual address space - CPU translates from virtual to physical address space 39 32 31 0 node id 8 bits Node offset 32 bits (4 GB) Empty slot page 0 1 2 n Physical k 1 n 0 Memory present 0 1 2 3 .. Node id Virtual TLB TLB – Translation Look-aside Buffer

Summary: origin2000 properties - Single machine image * behaves like a large workstation * same compilers * time sharing * all SGI old code (binaries) will run * OS schedules the hardware resources on the machine - processor scalability 2-1024 cpu - I/O scalability - all memory and I/O devices are directly addressable * no limitations on the size of a single program, it can use all available memory * no limitations on the location of the data, all disks can be used in a single file system - 64 bit operating system and file system * HPC features: Checkpoint/restart, queueing system - machine stability

Origin2000/3000 architecture goal Hardware design – distributed memory But: to a programmer – It looks like shared memory

Example: Simple Memory Access

Parix run limits (1) NQS queues on parix (2) Interactive Maximum cputime = 15 minutes

Two ways to run a batch job (1) Parameters in command line (2) Parameters in script file

QSUB options

Output of command: “qstat –a”

Exercise 1 – login and submit a job

Parallel Programming on the SGI Origin2000

Parallel Programming on the SGI Origin2000

Presentation Transcript

BSP on the Origin2000

Parallel Programming

Programming the Origin2000 with OpenMP: Part II

PARALLEL programming

Parallel Programming

Parallel Programming

Parallel Programming

Parallel Programming

A comparison of CC-SAS, MP and SHMEM on SGI Origin2000

Special Lectures on Parallel Programming

Parallel Programming

Parallel Programming on Computational Grids

Parallel Programming On the IUCAA Clusters

Parallel Programming on the SGI Origin2000

Parallel Programming

BSP on the Origin2000

Parallel Programming

Parallel/Concurrent Programming on the SGI Altix

Parallel Programming on Computational Grids