280 likes | 462 Views
Parallel Programming on the SGI Origin2000. Taub Computer Center Technion. Moshe Goldberg, mgold@tx.technion.ac.il. With thanks to Igor Zacharov / Benoit Marchand, SGI. Mar 2004 (v1.2). Parallel Programming on the SGI Origin2000. Parallelization Concepts SGI Computer Design
E N D
Parallel Programming on the SGI Origin2000 Taub Computer Center Technion Moshe Goldberg, mgold@tx.technion.ac.il With thanks to Igor Zacharov / Benoit Marchand, SGI Mar 2004 (v1.2)
Parallel Programming on the SGI Origin2000 • Parallelization Concepts • SGI Computer Design • Efficient Scalar Design • Parallel Programming -OpenMP • Parallel Programming- MPI
Origin2000/3000 architecture features Important hardware and software components: * node board: processors + memory * node interconnect topology and configurations * scalability of the architecture * directory-based cache coherency * single system image components
Origin node board HUB crossbar ASIC: - Single chip integrates all four functions: * processor interface: two rxK processors on the same bus * memory interface, integrating the memory controller and (direct) cache coherency * interface to CrayLink Interconnect to other nodes in the system * interface to I/O defices with XIO-to-PCI bridges - Memory access characteristics: * read bandwidth single processor 460 MB/s sustained * average access latency 315 ns to restart processor pipeline
Origin router interconnect - Router chip has 6 CrayLink interfaces: 2 for connections to nodes (HUBs) and 4 for connections to other routers in the network * 4-dimensional interconnect - The interconnect topology is determined by the size of the computer (number of nodes): * direct (back-to-back) connection for 2 nodes (4 cpu) * strongly connected cube up to 32 cpu * hypercube for up to 64 cpu * hypercube of hypercubes for up to 256 cpu
Origin2000 interconnect 32 processors 64 processors
Directory-based uniform cache Cache line use is recorded in directory, which resides in memory
Origin cache coherence - Memory page is divided in data blocks of 32 words or 128 bytes each (L2 cache line size) - Each data request transfers one data block (128 bytes) - Each data block has associated presence and state information directory memory . . . . . . . . . . . . presence state 64 bits 3 bits data block (cache line) 128 bytes (32 words) - If a node (HUB) requests a data block, the corresponding presencebit is set and the state of that cache line is recorded - HUB runs the cache coherence protocol, updating the state of the data block and notifying nodes for which the presence bit is set
Origin address space - Physically the memory is distributed and not contiguous - Node id is assigned at boot time - Logically memory is a shared single contiguous address space, the virtual address space is 44 bits (16 TB) - A program (compiler) uses the virtual address space - CPU translates from virtual to physical address space 39 32 31 0 node id 8 bits Node offset 32 bits (4 GB) Empty slot page 0 1 2 n Physical k 1 n 0 Memory present 0 1 2 3 .. Node id Virtual TLB TLB – Translation Look-aside Buffer
Summary: origin2000 properties - Single machine image * behaves like a large workstation * same compilers * time sharing * all SGI old code (binaries) will run * OS schedules the hardware resources on the machine - processor scalability 2-1024 cpu - I/O scalability - all memory and I/O devices are directly addressable * no limitations on the size of a single program, it can use all available memory * no limitations on the location of the data, all disks can be used in a single file system - 64 bit operating system and file system * HPC features: Checkpoint/restart, queueing system - machine stability
Origin2000/3000 architecture goal Hardware design – distributed memory But: to a programmer – It looks like shared memory
Parix run limits (1) NQS queues on parix (2) Interactive Maximum cputime = 15 minutes
Two ways to run a batch job (1) Parameters in command line (2) Parameters in script file