1 / 109

High Performance Computing – CISC 811

High Performance Computing – CISC 811. Dr Rob Thacker Dept of Physics (308A) thacker@physics. Today’s Lecture. Distributed Memory Computing I. Part 1: Networking issues and distributed memory architectures (hardware) Part 2: Brief overview of PVM Part 3: Introduction to MPI.

benson
Download Presentation

High Performance Computing – CISC 811

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics

  2. Today’s Lecture Distributed Memory Computing I • Part 1: Networking issues and distributed memory architectures (hardware) • Part 2: Brief overview of PVM • Part 3: Introduction to MPI

  3. Part 1: Concepts & Distributed Memory Architectures • Overview: Message passing API’s, RDMA • Networking, TCP/IP • Specialist interconnects • Distributed memory machine design and balance

  4. Message Passing API’s Operate effectively on distributed memory architectures Shared memory only Parallel API’s from the decomposition-communication perspective Explicit MPI,PVM SHMEM CAF, UPC Communication HPF OpenMP Implicit Implicit Explicit Decomposition

  5. Message Passing • Concept of sequential processes communicating via messages was developed by Hoare in the 70’s • Hoare, CAR, Comm ACM, 21, 666 (1978) • Each process has its own local memory store • Remote data needs are served by passing messages containing the desired data • Naturally carries over to distributed memory architectures • Two ways of expressing message passing: • Coordination of message passing at the language level (e.g. Occam) • Calls to a message passing library • Two types of message passing • Point-to-point (one-to-one) • Broadcast (one-to-all,all-to-all)

  6. Broadcast versus point-to-point Broadcast (one-to-all) Point-to-point(one-to-one) Process 2 Process 2 Process 1 Process 1 Process 3 Process 3 Process 4 Process 4 • Collective operation • Involves a group • of processes • Non-Collective operation • Involves a pair of • processes

  7. Message passing API’s • Message passing API’s dominate • Often reflect underlying hardware design • Legacy codes can frequently be converted more easily • Allows explicit management of memory hierarchy • Message Passing Interface (MPI) is the predominant API • Parallel Virtual Machine (PVM) is an earlier API that possesses some useful features over MPI • Remains the best paradigm for heterogeneous systems

  8. http://www.csm.ornl.gov/pvm/pvm_home.html PVM – An overview • API can be traced back to 1989 • Geist & Sunderam developed experimental version • Daemon based • Each host runs a daemon that controls resources • Process can be dynamically created and destroyed • PVM Console • Each user may actively configure their host environment • Process groups for domain decomposition • PVM group server controls this aspect • Limited number of collective operations • Barriers, broadcast, reduction • Roughly 40 functions in the API

  9. PVM API and programming model • PVM most naturally fits a master-worker model • Master process responsible for I/O • Workers are spawned by master • Each process has a unique identifier • Messages are typed and tagged • System is aware of data-type, allows easy portability across heterogeneous network • Messages are passed via a three phase proces • Clear (initialize) buffer • Pack buffer • Send buffer

  10. Example code tid=pvm_mytid() if (tid==source){ bufid= pvm_initsend(PvmDataDefault); info = pvm_packint(&i1,1,1); info = pvm_packfloat(vec1,2,1); info = pvm_send(dest,tag); } else if (tid==dest){ bufid= pvm_recv(source,tag); info = pvm_upkint(&i2,1,1); info = pvm_upkfloat(vec2,2,1); } Sender Receiver

  11. MPI – An overview • API can be traced back to 1992 • First unofficial meeting of MPI forum at Supercomputing 92 • Mechanism for creating processes is not specified within API • Different mechanism on different platforms • MPI 1.x standard does not allow for creating or destroying processes • Process groups central to parallel model • ‘Communicators’ • Richer set of collective operations than PVM • Derived data-types important advance • Can specify a data-type to control pack-unpack step implicitly • 125 functions in the API

  12. MPI API and programming model • More naturally a true SPMD type programming model • Oriented toward HPC applications • Master-worker model can still be implemented effectively • As for PVM, each process has a unique identifier • Messages are typed, tagged and flagged with a communicator • Messaging can be a single stage operation • Can send specific variables without need for packing • Packing is still an option

  13. Remote Direct Memory Access • Message passing involves a number of expensive operations: • CPUs must be involved (possibly OS kernel too) • Buffers are often required • RDMA cuts down on the CPU overhead • CPU sets up channels for the DMA engine to write directly to the buffer and avoid constantly taxing the CPU • Frequently discussed under the “zero-copy” euphemism • Message passing API’s have been designed around this concept (but usually called remote memory access) • Cray SHMEM

  14. RDMA illustrated HOST A HOST B Memory/ Buffer Memory/ Buffer CPU CPU packet NIC (with RDMA engine) NIC (with RDMA engine)

  15. Networking issues • Networks have played a profound role in the evolution of parallel APIs • Examine network fundamentals in more detail • Provides better understanding of programming issues • Reasons for library design (especially RDMA)

  16. OSI network model • Grew out of 1982 attempt by ISO to develop Open Systems Interconnect (too many vendor proprietary protocols at that time) • Motivated from theoretical rather than practical standpoint • System of layers taken together = protocol stack • Each layer communicates with its peer layer on the remote host • Proposed stack was too complex and had too much freedom: not adopted • e.g. X.400 email standard required several books of definitions • Simplified Internet TCP/IP protocol stack eventually grew out of the OSI model • e.g. SMTP email standard takes a few pages

  17. Conceptual structure of OSI network Layer 7. Application(http,ftp,…) Upper level Layer 6. Presentation (data std) Layer 5. Session (application) Layer 4.Transport (TCP,UDP,...) Data transfer Layer 3. Network (IP,…) Routing Lower level Layer 2. Data link (Ethernet,…) Layer 1. Physical (signal)

  18. Internet Protocol Suite • Protocol stack on which the internet runs • Occasionally called TCP/IP protocol stack • Doesn’t map perfectly to OSI model • OSI model lacks richness at lower levels • Motivated by engineering rather than concepts • Higher levels of OSI model were mapped into a single application layer • Expanded some layering concepts within the OSI model (e.g. internetworking was added to the network layer)

  19. Internet Protocol Suite e.g. FTP, HTTP, DNS e.g. TCP, UDP, RTP, SCTP IP e.g. Ethernet, token ring e.g. T1, E1 “Layer 7” Application Layer 4.Transport Layer 3. Network Layer 2. Data link Layer 1. Physical

  20. Internet Protocol (IP) • Data-oriented protocol used by hosts for communicating data across a packet-switched inter-network • Addressing and routing are handled at this level • IP sends and receives data between two IP addresses • Data segment = packet (or datagram) • Packet delivery is unreliable – packets may arrive corrupted, duplicated or not at all, and out of order • Lack of delivery guarantees allows fast switching

  21. IP Addressing • On an ethernet network routing at the data link layer occurs between 6 byte MAC (Media Access Control) addresses • IP adds its own configurable address scheme on top of this • 4 byte address, expressed as 4 decimals on 0-255 • Note 0 and 255 are both reserved numbers • Division of numbers determines network number versus node • Subnet masks determine how these are divided • Classes of networks are described by the first number in the IP address and the number of network addresses • [192:255].35.91.* = class C network (254 hosts) (subnet mask 255.255.255.0) • [128:191].132.*.* = class B network (65,534 hosts) ( “ 255.255.0.0) • [1:126].*.*.* = class A network (16 million hosts) ( “ 255.0.0.0) Note the 35.91 in the class C example, and the 132. in the class B example can be different, but are filled in to show how the network address is defined

  22. Subnets • Type A networks are extremely large and are better dealt with by subdivision • Any network class can be subdivided into subnets • Broadcasts then work within each subnet • Subnet netmasks are defined by extending the netmask from the usual 1 byte boundary • e.g. 10000000=128, 11000000=192, 11100000=224

  23. Class C subnets *classic IP rules say you cannot use subnets with all zeros or ones in the network portion **Host address with all zeros frequently mean this host in many IP implementations, while 1 is the broadcast address ***The netmask specifies the entire host address

  24. Transmission Control Protocol (TCP) • TCP is responsible for division of the applications data-stream, error correction and opening the channel (port) between applications • Applications send a byte stream to TCP • TCP divides the byte stream into appropriately sized segments (set by the MTU* of the IP layer) • Each segment is given two sequence numbers to enable the byte stream to be reconstructed • Each segment also has a checksum to ensure correct packet delivery • Segments are passed to IP layer for delivery *maximum transfer unit

  25. UDP: Alternative to TCP • UDP=User Datagram Protocol • Only adds a checksum and multiplexing capabilitiy – limited functionality allows a streamlined implementation: faster than TCP • No confirmation of delivery • Unreliable protocol: if you need reliability you must build on top of this layer • Suitable for real-time applications where error correction is irrelevant (e.g. streaming media, voice over IP) • DNS and DHCP both use UDP

  26. Encapsulation of layers Application data TCP header Transport data IP header Network data enet header Data link data

  27. Link Layer • For high performance clusters the link layer frequently determines the networking above it • All high performance interconnects emulate IP • Some significantly better (e.g. IP over Myrinet) than others (e.g. IP over Infiniband) • Each data link thus brings its own networking layer with it

  28. Overheads associated with TCP/IP • Moving data from the application to the TCP layer involves a copy from user space to the OS kernel • Recall memory wall issue? Bandwidth to memory is not growing fast enough • Rule of thumb each TCP bit requires 1 Hz of CPU speed to perform copy sufficiently quickly • “Zero copy” implementations remove the copy from user space to OS (DMA) • TCP off-load engines (TOE) removes checksum etc from CPU • RDMA and removal of OS TCP overheads will be necessary for 10Gb to function effectively User Application OS TCP control COPY Data Link

  29. Ping-Ping messaging Single node fires of multiple messages Bandwidth test Ping-Pong messaging Pair of nodes cooperate on send-receive-send Latency and bandwidth test Ping-Ping, Ping-Pong Messaging passing benchmarking examines two main modes of communication Node 1 Send Node 1 Send Node 1 Send Node 1 Receive Node 1 Send Node 0 Receive Node 0 Receive Node 0 Receive Node 0 Receive Node 0 Send time time

  30. Overview of interconnect fabrics • Broadly speaking interconnect breakdown into the two camps: commodity vs specialist • Commodity: ethernet (cost<50 per port) • Specialist: everything else (cost > 800 dollars per port) • Specialist interconnects primarily provide two features over gigabit: • Higher bandwidth • Lower message latency

  31. Gigabit Ethernet • IEEE 802.3z • Most popular interconnect on top500 - 213 systems (none in top 10) • Optimal performance of around 80 MB/s • Significantly higher latency than the specialist interconnects (x10, at 25 µs) • However fast libraries have been written e.g. GAMMA, Scali, latencies around 10µs • Popular because of hardware economics • Cables very cheap • Switches getting cheaper all the time (32 port switch = $250) • NICs beginning to include additional co-processing hardware for off-loading TCP overheads from CPU (TCP off-load engines: TOEs) • Project to improve performance of network layer (VIA) has been derailed

  32. 10Gigabit Ethernet • Relatively recent technology, still very expensive (estimates are in the range of $2000=$3000 per port) • Solutions in HPC arena limited to products supplied by already established HPC vendors, e.g. Myrinet, Quadrics • Commoditization is expected, but difficult to pick out a driver for it – few people need a GB/s out of their desktop

  33. Myrinet • www.myrinet.com • Proprietary standard (1994) updated in 2000 to increase bandwidth • Second most popular interconnect on Top 500 (79 systems) 1 system in top 10 • E model has standard bandwidth, 770 MB/s • Myrinet 10G is a 10Gigabit Ethernet solution offering 1250 MB/s (new) • Latency 2.6µs (recent library update) • Most systems implemented via fully switched network • Dual channel cards available for higher bandwidth requirements

  34. Scalable Coherent Interface • Developed out of IEEE standard (1992) updated in 2000 (popular in Europe) • No systems in top500 • Originally conceived as a standard to build NUMA machines • Switchless topologies fixed to be either ring or torus (3d also) • Latency around 4.5 us, but is well optimized for short messages • Bandwidth ~350 MB/s per ring channel (no PCI-X or PCI-Express implementation yet) • Card costs ~ 1500 USD for 3d torus version • European solution – hence less popular in N. America

  35. Infiniband • Infiniband (Open) standard is designed to cover many arenas, from database servers to HPC • 78 systems on top500, 3 systems in top10 • Large market => potential to become commoditized like gigabit • Serial bus, can add bandwidth by adding more channels (e.g. 1x,4x,12x standards, each 1x=250 MB/s) • Double data rate option now available (250→500 MB/s) • Highest bandwidth observed thus far ~900 MB/s for 4x single data rate link • Slow adoption among HPC until very recently – was initially troubled by higher latencies (around 6.5 ms, lower now) • Usually installed as a fully switched fabric • Cards frequently come with 2 channels for redundancy

  36. Costs and future • Current cost around $1000 per port (for cards+switch+cable) • PCI-Express cards set to reduce cost of host channel adapter (HCA=NIC), removes memory from card (cost may fall to $600 per port) • Software support growing, common linux interface underdevelopment www.openib.org • Discussion of installing directly on motherboard, would improve adoption rate • Latency is coming down all the time – a lot of people working on this issue

  37. Quadrics (Elan4) • www.quadrics.com • 14 systems on top500, 1 system in top 10 • Most expensive specialist interconnect ~1600 to 3000 USD per port depending upon configuration • Had lowest latency of any available interconnect (1.5 µs) • New product from Pathscale “Infinipath” may be slightly faster • Theoretical 1066 MB/s bandwidth, 900 achieved • Fat-tree network topology • High efficiency RDMA engine on board • NIC co-processor capable of off-loading communication operations from CPU

  38. Quick History • Quadrics grew out of Meiko (Bristol UK) • Meiko CS-1 transputer based system, CS-2 built around HyperSparc and the first generation Elan network processor

  39. Quadrics (Elan4) • Full 64-bit addressing • STEN=Short Transaction Engine • RDMA engine is pipelined • 64-bit RISC processor • 2.6 Gbytes/s total

  40. Optimizing short messages • I/O writes coming from the PCI-X bus can be formatted directly into network transactions • These events can be handled independently of DMA operations – avoids setting up DMA channel • Method becomes less efficient than using DMA engine at around 1024 byte messages

  41. A few Elan details • Performance strongly influenced by PCI implementation & speed of memory bus • Opteron (Hypertransport) is best of breed PCI • Network fault detection and tolerance is done in hardware (all communication is acknowledged) • Elanlib library provides low level interface to point-to-point messaging, and tagged messaging • MPI libs built upon either elanlib or even lower level library elan3lib

  42. Breakdown of transaction times: worst case 8-byte put Nanoseconds From presentation by John Taylor, Quadrics Ltd

  43. QsNetII MPI All Reduce Performance: Off-load

  44. Interconnect I/O bus comparison MB/s

  45. Latencies µs

  46. Summary Part 1 • TCP/IP has a large number of overheads that require OS input • Specialist interconnects alleviate these problems and have significantly higher throughput and lower latencies • Cost remains a barrier to the widespread adoption of higher performance interconnects • RDMA will appear in the next generation of gigabit (10Gb)

  47. Part 2: • Parallel Virtual Machine • Brief overview of PVM environment and console • Programming with PVM

  48. Why PVM? • PVM was envisaged to run on heterogeneous networks • Technically MPI can run on heterogeneous networks too, but it is not designed around that idea (no common method for starting programs) • PVM is an entire framework • Completely specifies how programs start on different architectures (spawning) • Resource managers are designed into the framework • Some parallel languages use the PVM framework (e.g. Glasgow Haskell) • Interoperable at the language level • Could have C master thread and FORTRAN workers • Fault tolerance is easily integrated into programs • Primarily a result of the master-worker type parallelism that is preferred • Comparatively small distribution – only a few MB • Secure message transfer • “Free” book: • http://www.netlib.org/pvm3/book/pvm-book.html

  49. Underlying design principles • User configuration of ‘host pool’ • Machines maybe added or deleted during operation • Transparent access to hardware • Process-based granularity • Atom of computation is the task • Explicit message passing • Message sizes are only limited by available memory • Network and multiprocessor support • Messages and tasks are interoperable between the two • Multiple language support (Java added recently) • Evolution and incorporation of new techniques should be simple • i.e. not involve defining a new standard

More Related