DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 2: ARCHITECTURE

DISTRIBUTED ANDHIGH-PERFORMANCE COMPUTINGCHAPTER 2: ARCHITECTURE Dr. Nor Asilah Wati Abdul Hamid Room 2.15 Ext : 6532 FSKTM, UPM

HPC Architectures

HPC Architectures • Flynn's Taxonomy provides a simple, but very broad, classiffication of architectures for high-performance computers: • Single Instruction, Single Data (SISD) A single processor with a single instruction stream, operating sequentially on a single data stream. • Single Instruction, Multiple Data (SIMD) A single instruction stream is broadcast to every processor, all processors execute the same instructions in lock-step on their own local data stream. • Multiple Instruction, Multiple Data (MIMD) Each processor can independently execute its own instruction stream on its own local data stream. • SISD machines are the traditional single-processor, sequential computers - also known as Von Neumann architecture, as opposed to “non-Von" parallel computers. • SIMD machines are synchronous, with more fine-grained parallelism - they run a large number parallel processes, one for each data element in a parallel vector or array. • MIMD machines are asynchronous, with more coarse-grained parallelism - they run a smaller number of parallel processes, one for each processor, operating on the large chunks of data local to each processor.

Distributed Memory (NUMA) • Data set is distributed among processors, each processor accesses only its own data from local memory, if data from another section of memory (i.e. another processor) is required, it is obtained by passing a message containing the data between the processors. • One or more host processors provide links to the outside world (disk and file access, compilers, Ethernet, Unix, etc). For SIMD machines, a host (or control) processor may broadcast the instructions to all processors. • Much larger overhead (latency) for accessing non-local data, but can scale to large numbers (thousands) of processors for many applications. Communication Network P Processor M Local Memory Host Processor P 1 P 2 P 3 M 1 M 2 M 3

Shared Memory (UMA) • Each processor has access to all the memory, through a shared memory bus and/or communication network. • Requires locks and semaphores to avoid more than one processor accessing or updating the same section of memory at the same time. • Lower overhead for accessing non-local data, but difficult to scale to large numbers of processors, usually used for small numbers (order 100 or less) of processors. • Heirarchical shared memory attempts to provide scalability by using an additional message-passing approach between SMP clusters that emulates true shared memory, but with non-uniform access times. Shared Global Memory Communication Network P1 P2 Pn ……….

Main HPC Architectures • SISD - mainframes, workstations, PCs • SIMD Shared Memory - Vector machines, Cray and imitators (NEC, Hitachi, Fujitsu, etc) • MIMD Shared Memory - Encore, Alliant, Sequent, KSR, Tera, Silicon Graphics, Sun, DEC/Compaq, HP • SIMD Distributed Memory - ICL/AMT/CPP DAP, TMC CM-2, Maspar • MIMD Distributed Memory - nCUBE, Intel, Transputers, TMC CM-5, plus more recent PC and workstation clusters (IBM SP2, DEC Alpha, Sun) connected with various networking/switching technologies. Note that modern sequential machines (workstations and PCs) are not purely SISD - modern processors use many concepts from vector and parallel architectures (pipelining, parallel execution of instructions, prefetching of data, etc) in order to achieve one or more arithmetic operations per clock cycle. Many concepts from yesterday's supercomputers are used in today's PCs.

Issues for Distributed Memory Architectures Latency and Bandwidth for accessing distributed memory is the main performance issue. • Efficiency in parallel processing is usually related to ratio of time for calculation vs time for communication - the higher the ratio, the better the performance. • Processors are rapidly increasing in speed (Moore's Law), but it is hard to obtain similar increases in speed for accessing memory, so memory access has become a bottleneck. Caching and memory heirarchies are used, but it is difficult to develop compilers to utilize these effectively. • Problem is even more severe when access to distributed memory is needed, since there is an extra level in the memory heirarchy, with latency and bandwidth that can be orders of magnitude slower than local memory access. Scalability to more processors is a key issue. • Access times to “distant" processors should not be very much slower than access to “nearby" processors, since non-local and collective (all-to-all) communication is important for many programs. • This can be a problem for large parallel computers (hundreds or thousands of processors). Many different approaches to network topology and switching have been tried in attempting to alleviate this problem.

Distributed Memory Access • Latencyis the overhead in setting up a connection between processors for passing data. This is the most crucial problem for all parallel computers - obtaining good performance over a range of applications depends critically on low latency for accessing remote data. Current processors can perform hundreds of Flops per microsecond, whereas typical latencies can be 1 – 100 microseconds. • Bandwidthis the amount of data per unit time that can be passed between processors. This needs to be large enough to support efficient passing of large amounts of data between processors, as well as collective communications, and I/O for large data sets. • Scalabilityis how well latency and bandwidth scale with the addition of more processors. This is usually only a problem for supercomputers with hundreds or thousands of processors. Smaller configurations of tens of processors can usually be efficiently connected. Many different kinds of network topologies have been used.

Network Architectures Network Topology is how the processors are connected. • 2D or 3D mesh is simple and ideal for programs with mostly nearest-neighbour communication. General communications can be good if fast routers are used. • Hypercube is an attempt to minimize number of “hops“ between any two processors. Doesn't scale well – too many wires for large dimensions required for large numbers of processors. • Switched connections have all processors directly connected to one or more high-speed switches, which introduce an overhead but can be quite fast. • Many other more complex or hierarchical topologies are also used. Network topology should be transparent to the user. Portable languages such as MPI and HPF interface to point-to-point and collective communications at a high level; details of the implementation for a specific network topology are left to the compiler.

Vector Architectures Key ideas: • vector registers; • pipelines; • fast I/O channels; • fast Physical memory - no cache; • banks of memory and disk systems; • multiple instructions possible at once; • multi head (processor) systems use shared memory to communicate; • can fake other communications paradigms using SHMem Put and Get; • lots of custom compiler technology; • lots of automatic and user optimisations (compiler directives); • physical engineering - liquid cooling, minimal wire lengths; • custom silicon of GaAs... The vector ideas have been incorporated into many other machines and processors.

Vector Machines Historically some of the major vector supercomputersinclude: • Cray 1, Cray XMP, Cray YMP, Cray 2, C90, J90, T90 (Cray Research Inc.) • Cray 3, Cray 4. (Cray Computer Corp.) • CDC Cyber 205, ETA. • Fujitsu VP • Hitachi S810 • NEC SX2, SX3, SX5 Early vector machines were very successful due to: • large amounts of fast memory to allow large problem sizes • very fast processors (relative to scalar processors of the time) • straightforward porting of code using compiler directives • good compilers Commodity processors are now providing betterprice/performance, so future of vector systems is unclear. May see vector processors on commodity chips for image processing etc.

Processors for SIMD Architectures Thousands of cheap, simple processors, or hundreds of more expensive, high-end processors? • Lots of puny processors • Early SIMD machines (DAP, CM-1) used thousands of feeble, single-bit processors. • This approach is good for applications with mostly logic operations, bit manipulation, integer arithmetic, e.g. image processing, neural networks, data processing. Not good for floating point arithmetic. • With this model, can put many processors on a single chip, then connect many such chips to produce high-speed but very compact machines (e.g. image processing boards in smart missiles). • Not so many heavyweight processors • Later SIMD machines (CM-5, MP-2) used more powerful 32-bit processors, often with vector floating point units. • This approach is better for applications that are floating point intensive. • With this model, a SIMD machine can be viewed as a large, distributed memory version of a vector machine (Cray), but more like a parallel array (2D grid) model than a vector (1D pipeline) model.

Programming for SIMD Architectures • SIMD model can only effciently support regular, synchronous applications, which are perhaps only half of all HPC applications. Thus SIMD computers are usually used as special-purpose machines, e.g. for image processing, regular grid problems, neural networks, data mining, data processing. • Key issue is whether SIMD machines are more cost-effective than MIMD machines for the applications they can handle. Currently only true for small number of mage processing and data processing applications, since commodity processors have greatly increased in power. Is this limited segment of a limited HPC market enough to keep the remaining companies (Maspar and CPP) commercially viable, when under increasing pressure from e.g. SGI, IBM, Sun, Compaq? • SIMD machines are easier to program (using high-level data parallel languages) than MIMD machines (using lower-level message-passing languages), but the possible applications are restricted. Most modern general-purpose parallel computers are MIMD, but provide compilers for data parallel languages that emulate a SIMD model on a MIMD message-passing machine.

SIMD Distributed Memory Machines • ICL/AMT/CPP DAP 1K to 4K proprietary bit-serial processing elements (PEs). 64 PEs per chip. Later models have optional floating point unit (FPU). Connected as 2D grid, with fast row/column data highways. • TMC CM-1 and CM-2 16K to 64K proprietary bit-serial PEs, multiple processors per chip. Connected as hypercube. 32 PEs can share a vector FPU in the CM-2. • Maspar MP-1 and MP-2 1K to 16K proprietary PEs, 4-bit for MP-1 (16 to a chip), 32-bit for MP-2. 2D grid, with connections to 8 nearest neighbors. SIMD machines were originally targeted at specific problems like image processing and AI (CM-1 could initially only be programmed in parallel Lisp!). Attempted to become more general purpose by adding FPUs, but could not compete with MIMD machines. Now reverting to special-purpose architecture. Some interest in building SIMD-like supercomputers using programmable logic chips (FPGAs). Many processors now have SIMD extensions (G4 Velocity Engine, Pentium SSE, Athlon 3D Now!, graphics chips).

Processors for MIMD Architectures • Early MIMD machines (e.g. Caltech's Cosmic Cube) used cheap of-the-shelf processors, with purpose-built inter-processor communications hardware. Motivation for building early parallel computers was that many cheap microprocessors could give similar performance to an expensive Cray vector supercomputer. • Later machines (e.g. nCUBE, transputers) used proprietary processors, with on-chip communications hardware. These could not compete with the rapid increase in performance of mass-produced processors for workstations and PCs, e.g. year-old 16-processor nCUBE or transputer machine typically had same performance as a new single-processor workstation. Moreover, now we have multi core processors. • Current MIMD machines use of-the-shelf processors (as before), usually RISC processors used in state-of-the-art high-performance workstations (IBM RS-6000, SGI MIPS, DEC Alpha, Sun UltraSPARC, HP PA-RISC). Processors for PCs are now of comparable performance, and the first machine to reach 1 Teraflop (1 trillion floating point operations per second), the 9200-processor ASCI Red from Intel, uses Pentium Pro processors. These machines still need special communications hardware, and expensive high-speed networks and switches.

Massively Parallel Processors • The first MIMD parallel computers were tightly coupled machines, with racks of motherboards containing processors plus memory plus communications hardware. Each board might contain multiple nodes (processors plus memory). • Local memory is usually quite small (256 Kbytes for early machines) and has to hold a copy of the operating system and the user program, as well as local data. Could not use full-featured OS like Unix, so used small proprietary OS (often with undesirable \features"). • Often called massively parallel processors (MPPs) since their components are cheap and tightly coupled, enabling them to scale to large numbers of processors. • Examples are nCUBE, Intel Paragon, TMC CM-5, Meiko, Cray T3E. • Dominated HPC parallel processing market during 80'sand early 90s.

MIMD Distributed Memory Machines • Hypercubes Cosmic Cube, nCUBE, Intel Calech Cosmic Cube used hypercube of Intel 8086/7 processors. Commercial versions from nCUBE and Intel. nCUBE used hypercube of 16 to 1K proprietary processors with on-chip communications logic. Intel developed iPSC in various versions, 16 to 128 processors (286, 386, vector units). Intel Paragon up to order 1000 processors, ASCI Red has almost 10,000. • Transputers Designed for parallel processing in embedded systems. Memory and 4 comms links built in. Machines of 16 to few hundred processors built by Meiko etc. using 2D mesh, some were reconfigurable. • TMC CM-5 32 to 16K SPARC processors with optional vector units. Runs SIMD (data parallel) or MIMD (message passing) programs. Connected in “fat tree" network, higher bandwidth higher up the tree. Bandwidth to farthest processors only 4 times slower than to neighbours. • Workstation clusters IBM SP2, Compaq/DEC, Sun. These are racks of workstations connected by (usually proprietary) high-speed networks and switches.

Workstation or PC Clusters • In the last few years, improved networking and switching technology has made it possible to connect clusters of workstations with latencies and bandwidths comparable to tightly-coupled machines. • Each node may be more expensive, but more powerful (latest processors, lots of memory, local disk). • Most major Unix workstation manufacturers (IBM, Compaq/DEC, Sun) offer parallel computers of this kind, as compute servers or for online transaction processing. • With improved performance (and price/performance) of PCs, and the availability of Linux and Solaris (and NT) for PCs, many groups building clusters of PCs connected by 100Mb Ethernet or high-speed Myrinet/ServerNet/GigaNet networks. • Examples: 116 dual Pentium cluster at Adelaide University, 1000 PC cluster built by a US company. • Sometimes called Beowulf clusters, after initial work by Beowulf project at NASA. • Beowulf clusters using Alpha processors are also very popular for floating-point intensive scienti_c computing, e.g. CPlant I and II at Sandia National Lab in US have 400 and 592 Alpha-based workstations.

Advantages of Workstation Clusters • Cost benefits of using mass-produced workstation hardware. • Can provide more memory and disk per processor. • Each processor can run familiar Unix (or NT) OS. • Machine can either be used to run multi-processor parallel programs for a small number of users, or as a • pool of independent processors to run multiple sequential jobs from many users (job-level parallelism). • Users do not need to learn new OS, or know anything about parallel computing, in order to use the machine. • Companies are much more willing to purchase clusters of workstations with familiar hardware and OS from mainstream computer companies (IBM, DEC, Sun, HP), rather than massively parallel supercomputers with unfamiliar hardware and OS from “flaky" start-ups (nCUBE, Thinking Machines, Meiko). Workstation clusters (mostly IBM SP2 and Beowulf clusters) are now the main distributed memory machines. Tightly-coupled custom MPPs (e.g. Cray T3E) are now serving only a niche market for very high-end science and engineering applications and will probably soon disappear.

Enterprise Shared Memory Systems • In the last few years, parallel computing has entered the mainstream with wide adoption of symmetric multi-processor (SMP) shared memory machines. • Dual and quad processor PCs, Macs and Unix workstations now available on the desktop. • Enterprise (large business-class) SMP machines have taken over from mainframes as servers: - Sun HPC Enterprise Systems (2 proc E250 ! 64 proc E10000) - SGI Origin (1-4 proc Origin 200 ! 64 proc Origin 2000) - Compaq AlphaServer (2 proc DS20 ! 64 proc SC) - HP (1-8 proc N4000 ! 32 proc V2600) • High-end SMP machines are very expensive (a few million dollars). • Shared memory does not scale to more than about 64 processors (usually all that customers want or can afford!), but SMP machines can be clustered together with high-speed networks. • Mainly targeted at business applications (particularly large on-line services, databases, etc), since they offer reliability, lots of memory, ease of programming. • Sun and HP most popular for business apps, SGI and Compaq most popular for scientific apps.

Cache coherent Non Uniform Memory Access • cache coherence (also cache coherency) refers to the integrity of data stored in local caches of a shared resource. Cache coherence is a special case of memory coherence.

HPC Hardware • Processor • Networks

Processor • Intel – Xeon, Itanium • AMD Phenom • IBM Blue Gene • IBM PowerPC 970 • The Sparc processor

Networks • Fast Ethernet/Gigabit Ethernet • Myrinet • QSNet • Infiniband • SCI

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 2: ARCHITECTURE