1 / 45

High Performance Communications

This lecture continues the discussion on Eager and Rendezvous protocols, covering topics such as polling vs. interrupts, scalability patterns, and low-latency communication in HPC. It explores the "Eager" and "Rendezvous" message protocols, special protocols for DSM, and features, advantages, and trade-offs of each protocol. The lecture also touches on buffering, completion semantics, packetization, non-contiguous datatypes, inter-node tuning, and the impact of eager thresholds.

kdrew
Download Presentation

High Performance Communications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HighPerformanceCommunications Lecture #3 Alex Margolin Winter 2017

  2. Lecture Outline • Eager & Rendezvous protocols (contd.) • Polling vs. Interrupts • Scalability patterns <Break> • Low-latency communication in HPC

  3. Part 1:Eager & Rendezvous

  4. Message protocols • Message consists of “envelope” and data • Envelope contains tag, communicator, length, source information, plus impl. private data • Short • Message data (message for short) sent with envelope • Eager • Message sent assuming destination can store • Rendezvous • Message not sent until destination oks

  5. Special Protocols for DSM • Message passing a good way to use distributed shared memory (DSM) machines because it provides a way to express memory locality. • Put • Sender puts to destination memory (user or MPI buffer). Like Eager • Get • Receiver gets data from sender or MPI buffer. Like Rendezvous. • Short, long, rendezvous versions of these

  6. Data Data Data Data Data Data Data Data Data The “Eager” Protocol • On the Responder – Post some receive buffers (no opcode needed) • On the Requestor – Send to a remote QP (by QPN), using the SEND opcode • Sends are typically “SIGNALED”: we want a work completion for it when it’s done • Optional: “Completion moderation”, meaning not every send is “signaled” • we’ll talk more about it under “Flow control” • This protocol has minimal startup overheads and is used to implement low latency message passing for smaller messages. • The down-side: memcpy! (not “zero-copy”) • What happens if the Responder is not ready? • Completion with error! • status is not WC_SUCCESS Process 0 Process 1 Time

  7. Eager Features • Reduces synchronization delays • Simplifies programming (just MPI_Send) • Requires significant buffering • May require active involvement of CPU to drain network at receiver’s end • May introduce additional copy (buffer to final destination)

  8. How Scaleable is Eager Delivery? • Buffering must be reserved for arbitrary senders • User-model mismatch (often expect buffering allocated entirely to “used” connections). • Common approach in implementations is to provide same buffering for all members of MPI_COMM_WORLD; this is optimizing for non-scaleable computations

  9. Data Data Data Data Data Data Data Data Yes The “Rendezvous” Protocol • Rendezvous is fancy word for “meeting” • On the Requestor – “Eager-Send” notification (just size and location info) • On the Responder – wait for user OR allocate space, then call RDMA_READ • OR send back local info (address + rkey), and requestor calls RDMA_WRITE • This protocol is used for transferring large messages when the sender is not sure whether the receiver actually has the buffer space to hold the entire message. May I Send? Data Process 0 Process 1

  10. Rendezvous Features • Robust and safe • (except for limit on the number of envelopes…) • May remove copy (user to user direct) • More complex programming (waits/tests) • May introduce synchronization delays (waiting for receiver to ok send)

  11. Short Protocol • Data is part of the envelope • Otherwise like eager protocol • May be performance optimization in interconnection system for short messages, particularly for networks that send fixed-length packets

  12. User vs. System Buffering • Where is data stored (or staged) while being sent? • User’s memory • Allocated on the fly • Preallocated • System memory • May be limited • Special memory may be faster

  13. Completion semantics • Non-blocking - Operation does not wait for completion • Blocking – function returns when the buffer is available (not on completion!) • synchronous - Completion of send requires start (not completion) of receive • Includes “sanity-errors” such as a closed connection object • asynchronous - communication and computation take place simultaneously • ready - Correct send requires a matching receive

  14. Last Packet may be shorter Process 0 Process 1 Data sent in individual packets Packetization • Some networks send data in discrete chunks called packets

  15. Non-contiguous Datatypes • Provided to allow MPI implementations to avoid copy Extra copy Network • Not widely implemented yet • Handling of important special cases • Constant stride • Contiguous structures

  16. Inter-node Tuning: Eager Thresholds • Switching Eager to Rendezvous transfer • Default: Architecture dependent on common platforms, in order to achieve both best performance and memory footprint • Threshold can be modified by users to get smooth performance across message sizes • Memory footprint can increase along with eager threshold Impact of Eager Threshold Eager vs Rendezvous

  17. Part 2:Polling & Interrupts

  18. Ping-pong Measurements • Client • round-trip-time 15.7 microseconds • user CPU time 100% of elapsed time • kernel CPU time 0% of elapsed time • Server • round-trip time 15.7 microseconds • user CPU time 100% of elapsed time • kernel CPU time 0% of elapsed time • InfiniBand QDR 4x through a switch

  19. How to reduce 100% CPU usage • Cause is “busy polling” to wait for completions • in tight loop on ibv_poll_cq() • burns CPU since most calls find nothing • Why is “busy polling” used at all? • simple to write such a loop • gives very fast response to a completion • (i.e., gives low latency)

  20. ”busy polling” to get completions • start loop • ibv_poll_cq() to get any completion in queue • exit loop if a completion is found • end loop

  21. How to eliminate “busy polling” • Cannot make ibv_poll_cq() block • no flag parameter • no timeout parameter • Must replace busy loop with “wait – wakeup” • Solution is a “wait-for-event” mechanism • ibv_req_notify_cq() - tell CA to send an “event” when next WC enters CQ • ibv_get_cq_event() - blocks until gets “event” • ibv_ack_cq_event() - acknowledges “event”

  22. API for receiver wait-wakeup WIRE USER CHANNEL ADAPTER allocate virtual memory register recv queue metadata . . . ibv_post_recv() parallel activity control control ibv_req_notify_cq() wait ibv_get_cq_event() data packets completion queue blocked ACK . . . status wakeup control ibv_ack_cq_events() ibv_poll_cq() access

  23. ”wait-for-event” to get completions • start loop • ibv_poll_cq() to get any completion in CQ • exit loop if a completion is found • ibv_req_notify_cq() to arm NIC to send event on next completion added to CQ • ibv_poll_cq() to get new completion between 2&4 • exit loop if a completion is found • ibv_get_cq_event() to wait until CA sends event • ibv_ack_cq_events() to acknowledge event • end loop

  24. ping-pong measurements with wait • Client • round-trip-time 21.1 microseconds – up 34% • user CPU time 9.0% of elapsed time • kernel CPU time 9.1% of elapsed time • total CPU time 18% of elapsed time – down 82% • Server • round-trip time 21.1 microseconds – up 34% • user CPU time 14.5% of elapsed time • kernel CPU time 6.5% of elapsed time • total CPU time 21% of elapsed time – down 79%

  25. Using multiple CPU cores • So far we learned about a single process, but today we have multi-core CPUs • CPU affinity improves performance • Why? Because the scheduler isn’t always optimal, plus we can pass it a “hint” • How? • API: sched_setaffinity, pthreads_setaffinity • Command-line: taskset –c • More cores = more bandwidth? • Only if CPU is the bottleneck… • Typically true for small messages • CPU is the bottleneck • Typically false for large RDMAs • Unless memory access is the bottleneck!

  26. Part 3:Scalability Patterns

  27. Server architectures • Thread-based Server Architectures • Multi-Process/Multi-Threaded Architectures • Pro: Isolation, per-context management (+FT) • Con: significant overhead per context… • Event-driven Server Architectures • Potentially use Non-blocking I/O Multiplexing • Pro: Better resource utilization (pre-resource queues) • Con: who stores the state… ? Concurrent Programming for Scalable Web Architectures - Diploma Thesis by Benjamin Erb

  28. Reactor vs. Proactor pattern • The Reactor pattern [Sch95] targets synchronous, non-blocking I/O handling and relies on an event notification interface. • On startup, an application following this pattern registers a set of resources (e.g. a socket) and events (e.g. a new connection) it is interested in. For each resource event the application is interested in, an appropriate event handler must be provided--a callback or hook method. • The core component of the Reactor pattern is a synchronous event demultiplexer, that awaits events of resources using a blocking event notification interface. • Whenever the synchronous event demultiplexer receives an event (e.g. a new client connection), it notifies a dispatcher and awaits for the next event. The dispatcher processes the event by selecting the associated event handler and triggering the callback/hook execution. • In contrast, the Proactor pattern [Pya97] leverages truly asynchronous, non-blocking I/O operations, as provided by interfaces such as POSIX AIO. As a result, the Proactor can be considered as an entirely asynchronous variant of the Reactor pattern seen before. • It incorporates support for completition events instead of blocking event notification interfaces. A proactive initiator represents the main application thread and is responsible for initiating asynchronous I/O operations.Whenissuing such an operation, it always registers a completition handler and completitiondispatcher. • The execution of the asynchronous operation is governed by the asynchronous operation processor, an entity that is part of the OS in practice. When the I/O operation has been completed, the completition dispatcher is notified. Next, the completition handler processes the resulting event.

  29. Reactor vs. Proactor Example: WWW

  30. The “Active message” paradigm • The concept: messages containing “Data + action” • Action encodes what to do with the data • No need to “save state” – decide based on the “action” • This is an extension of the reactor pattern, useful for small “services”. • How to apply it? • At init time – create handler functions for each type of action. • At run-time – each incoming packet has a header with the “action” • The appropriate callback is called with the payload of that packet Jeremiah James Willcock, TorstenHoefler, Nicholas Gerard Edmonds, and Andrew Lumsdaine. 2010. AM++: a generalized active message framework. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques (PACT '10).

  31. Thread-less libraries and progress routines • When do we know when an action (say, system call) is complete? • Blocking/synchronous calls – when it returns • Non-blocking/asynchronous calls – when it lets you know  • Return value (from calling it repeatedly), Signaled file-descriptor, Signal… • When waiting for an incoming message in user-level networks, two choices: • Polling for completions (e.g. ibv_poll_cq() ) • Pro: Low latency (peak IB latency is hundreds of nano-seconds) • Cons: CPU busy • Waiting on a file-descriptor (e.g. completion channels) • Pros: Yields CPU • Cons: Kernel-User context switch (tens of micro-seconds) • If you need latency – use periodic “progress calls” (you control the rate) • Like everything else in CS – it’s a tradeoff 

  32. Part 4:Low-latency in HPC

  33. High-Performance Computing (HPC) • Performance: • A quantifiable measure of rate of doing (computational) work • Multiple such measures of performance • Delineated at the level of the basic operation • ops – operations per second • ips – instructions per second • flops – floating operations per second • Rate at which a benchmark program takes to execute • A carefully crafted and controlled code used to compare systems • Linpack Rmax (Linpack flops) • gups (billion updates per second) • Others… • Two perspectives on performance • Peak performance • Maximum theoretical performance possible for a system • Sustained performance • Observed performance for a particular workload and run • Varies across workloads and possibly between runs Current Top Peak FP Performance: 125,436,000 GFlops/s = 125,436 TeraFlops/s = 125.43 PetaFlops/s Sunway TaihuLight(@National Supercomputing Center, Wuxi, China) 10,649,600 total processor cores: (40,960 Sunway SW26010 260C 260-core processors @1.45 GHz)

  34. Supercomputer #1: Sunway TaihuLight

  35. Drivers of Modern HPC Cluster Architectures • Multi-core/many-core technologies • Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) • Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD • Accelerators (NVIDIA GPGPUs and Intel Xeon Phi) Accelerators / Coprocessors high compute density, high performance/watt >1 TFlop DP on a chip High Performance Interconnects - InfiniBand <1usec latency, 100Gbps Bandwidth> SSD, NVMe-SSD, NVRAM Multi-core Processors Tianhe – 2 Stampede Tianhe – 1A Titan

  36. Towards Exascale Systems Courtesy: Prof. Jack Dongarra

  37. Two Major Categories of Applications • Scientific Computing • Message Passing Interface (MPI), including MPI + OpenMP, is the Dominant Programming Model • Many discussions towards Partitioned Global Address Space (PGAS) • UPC, OpenSHMEM, CAF, etc. • Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC) • Big Data/Enterprise/Commercial Computing • Focuses on large data and data analysis • Hadoop (HDFS, HBase, MapReduce) • Spark is emerging for in-memory computing • Memcached is also used for Web 2.0

  38. Applications (Science & Engineering) • MPI is widely used in large scale parallel applications in science and engineering • Atmosphere, Earth, Environment • Physics - applied, nuclear, particle, condensed matter, high pressure, fusion, photonics • Bioscience, Biotechnology, Genetics • Chemistry, Molecular Sciences • Geology, Seismology • Mechanical Engineering - from prosthetics to spacecraft • Electrical Engineering, Circuit Design, Microelectronics • Computer Science, Mathematics

  39. Turbo machinery (Gas turbine/compressor) Biology application Transportation & traffic application Drilling application Astrophysics application 39

  40. Parallel Programming Models • Programming models provide abstract machine models • Models can be mapped on different types of systems • e.g. Distributed Shared Memory (DSM), MPI within a node, etc. • PGAS models and Hybrid MPI+PGAS models are gradually receiving importance P1 P2 P3 P2 P3 P1 P2 P3 P1 Logical shared memory Memory Memory Memory Shared Memory Memory Memory Memory Shared Memory Model SHMEM, DSM Distributed Memory Model MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, …

  41. Overview of the MVAPICH2 Project • High Performance open-source MPI Library for InfiniBand, 10-40Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE) • MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002 • MVAPICH2-X (MPI + PGAS), Available since 2011 • Used by more than 2,500 organizations in 76 countries • More than 317,000 (> 0.3 million) downloads from the OSU site directly • Empowering many TOP500 clusters (Nov ‘15 ranking) • 10th ranked 519,640-core cluster (Stampede) at TACC • 13th ranked 185,344-core cluster (Pleiades) at NASA • 25th ranked 76,032-core cluster (Tsubame 2.5) at Tokyo Institute of Technology and many others • Empowering Top500 systems for over a decade • System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) -> • Stampede at TACC (10th in Nov’15, 519,640 cores, 5.168 Plops)

  42. Latency: MPI over IB & MVAPICH2 TrueScale-QDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 2.8 GHz Deca-core (Haswell) Intel PCI Gen3 Back-to-back

  43. BW: MPI over IB & MVAPICH2 TrueScale-QDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 2.8 GHz Deca-core (Haswell) Intel PCI Gen3 Back-to-back

  44. MVAPICH2 Challenges for Exascale • Scalability for million to billion processors • Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided RMA) • Extremely minimum memory footprint • Collective communication • Offload and Non-blocking • Integrated Support for GPGPUs • Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …) • Virtualization • Energy-Awareness

  45. Questions?

More Related