1 / 26

Performance Evaluation of Gigabit Ethernet-Based Interconnects for HPC Clusters

Performance Evaluation of Gigabit Ethernet-Based Interconnects for HPC Clusters. Paweł Pisarczyk pawel.pisarczyk@atm.com.pl Jarosław Węgliński jaroslaw.weglinski@atm.com.pl. Cracow, 16 October 2006. Agenda. Introduction HPC c luster interconnects Message propagation model

waldo
Download Presentation

Performance Evaluation of Gigabit Ethernet-Based Interconnects for HPC Clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance Evaluation of Gigabit Ethernet-Based Interconnects for HPC Clusters Paweł Pisarczyk pawel.pisarczyk@atm.com.pl Jarosław Węgliński jaroslaw.weglinski@atm.com.pl Cracow, 16 October 2006

  2. Agenda • Introduction • HPC cluster interconnects • Message propagation model • Experimental setup • Results • Conclusions

  3. Who we are • joint stock company • founded in 1994, earlier (since 1991) as a departmentwithin PP ATM • IPO in September 2004 (Warsaw Stock Exchange) • major shares owned by founders (Polish citizens) • no state capital involved • financial data • stock capital about €6 million • 2005 sales €29,7 million • about 230 employees

  4. Mission • building business value through innovative information & communication technology initiatives creating new markets in Poland and abroad • ATM's competitive advantage is based on combining three key competences: • integration of comprehensive IT systems • telecommunication services • consulting and software development

  5. Achievements 1991 Poland’s first company connected to Internet 1993 Poland’s first commercial ISP 1994 Poland’s first LAN with ATM backbone 1994 Poland’s first supercomputer on the Dongarra’s Top 500 list 1995 Poland’s first MAN in ATM technology 1996 Poland’s first corporate network with voice & data integration 2000 Poland’s first prototype Interactive TV system over a public network 2002 Poland’s first validated MES system for a pharmaceutical factory 2003 Poland’s first commercial, public Wireless LAN 2004 Poland’s first public IP content billing system

  6. Client base (based on 2005 sales revenues)

  7. HPC clusters developed by ATM • 2004 - Poznan Supercomputing and Networking Center • 238Itanium2 CPU, 119 x HP rx2600 nodes with Gigabit Ethernet interconnect • 2005 - University of Podlasie • 34 Itanium2 CPU,17 x HP rx2600 nodes with Gigabit Ethernet interconnect and Lustre 1.2 filesystem • 2005 - Poznan Supercomputing and Networking Center • 86 dual core Opteron CPU, 42 Sun SunFire v20z and 1 Sun SunFire v40z with Gigabit Ethernet interconnect • 2006 - Military University of Technology Faculty of Engineering, Chemistry and Applied Physics • 32 Itanium2 CPU, 16 x HP rx1620 with Gigabit Ethernet interconnect • 2006Gdansk University of Technology: Department of Pharmaceutial Technology  and Chemistry • 22 Itanium2 CPU (11 x HP RX1620) with Gigabit Ethernet interconnect

  8. Selected software projects related to distributed systems • Distributed Multimedia Archive in Interactive Television (iTVP) Project • scalable storage for iTVP platform with ability to process the stored content • ATM Objects • scalable storage for multimedia content distribution platform • system for Cinem@n company (founded by ATM and Monolith) • Cinem@n will introduce high-quality movies, news and entertainment digital content distribution services • Spread Screens Manager • platform for POS TV • system is currently used by Zabka (shopping network) and Neckermann (travel service) • about 300 of terminals presenting the multimedia content located in many polish cities

  9. Selected current projects • ATMFS • distributed filesystem for petabyte scale storage based on COTS • based on variable-sized chunks • advanced replication and enhanced error detection • dependability evaluation based on software fault injection technique • FastGig • RDMA stack for Gigabit Ethernet-based interconnect • message passing latency reduction • increases the application performance

  10. Uses of computer networks in HPC clusters • Exchange of messages between cluster nodes to coordinate distributed computation • requires high maximal throughput and also low latency • inefficiency observed when the time consumed in single computation step is comparable to the message passing time • Access to shared data through network or cluster file system • requires high bandwidth when transferring data in blocks of defined size • filesystem and storage drivers are trying to reduce number of i/o operations issued (by buffering data and aggregating transfers)

  11. Comparison of characteristics of interconnect technologies * Brett M. Bode, Jason J. Hill, and Troy R. Benjegerdes “Cluster Interconnect Overview” Scalable Computing Laboratory, Ames Laboratory

  12. Gigabit Ethernet interconnect characteristic • Popular technology for low cost cluster interconnects • Satisfied throughput for long frames (1000 bytes and longer) • High latency and low throughput for small frames • Those drawbacks are mostly caused by construction of existing network interfaces • What is the influence of the network stack implementation for the communication latency?

  13. Message propagation model Latency between transferring message to/from MPI library and transferring data to/from stack Time difference between sendto/recvfrom function and driver start_xmit/interrupt functions Execution time of driver functions Processing time of the network interface Propagation latency and latency introduced by active network elements

  14. Experimental setup • Two HP rx2600 servers • 2 x Intel Itanium2 1.3 MHz 3MB cache • Debian GNU/Linux Sarge 3.1 operating system (kernel 2.6.8-2-mckinley-smp) • Gigabit Ethernet interfaces • Broadcom BCM5701 chipset connected using PCI-X device bus • In order to eliminate possibility of additional delays, which may be introduced by external active network devices, servers were connected using crossover cables • Two NIC drivers were tested: tg3 (polling NAPI dirver), bcm5700 (interrupt driven driver)

  15. Tools used for measures • NetPipe package for measuring throughput and latency for TCP and several implementations of MPI • For low level testing test programs working directly on Ethernet frames were developed • Testing programs and NIC drivers were modified to allow measuring, inserting and transfer of timestamps

  16. Throughput characteristic for tg3 driver

  17. Latency characteristic for tg3 driver

  18. Results for tg3 driver • The overhead introduced by MPI library is relatively low • There is a big difference between transmission latencies in the ping-pong and streaming mode • The latency introduced for small frames is similar to latency introduced by 115kbps UART (in the case of transmitting one byte only) • We can deduce that there is some mechanism in the transmission path that delays transmission of single packets • What is the difference between NAPI and interrupt driven driver?

  19. Interrupt driven driver vs NAPI driver (throughput characteristic)

  20. Interrupt driven driver vs NAPI driver (latency characteristic)

  21. Interrupt driven driver vs NAPI driver (latency characteristic) - details

  22. Comparison of bcm5700 and tg3 drivers • Using default configuration, BCM5700 driver has worse characteristics than tg3 • Interrupt driven version (default configuration) cannot achieve more than 650Mb/s of throughput for frames of any size • After interrupt coalescing disabling, the performance of BCM5700 driver have exceeded the results obtained by tg3 driver • Disabling of the polling can improve characteristics of the network driver, but NAPI is not the major cause of the transmission delay

  23. Tools for message processing time measurement • Timestamps were inserted into the message eat each processing stage • Processing stages on the transmitter side • sendto() function • bcm5700_start_xmit() • interrupt notifying frame transmit • Processing stages on the receiver side • interrupt notifying frame receipt • netif_rx() • recvfrom()function • As high precision timerCPU clock cycle counter was used, (precision of 0.77ns = 1/1.3GHz)

  24. 2 us Transmitter latency in streaming mode Send Answer 17 us 17 us

  25. Distribution of delays in transmission path between cluster nodes

  26. Conclusions • We estimate that RDMA based communication can reduce MPI message propagation time from 43μs to 23μs (doubling the performance for short messages) • There is also possibility of reducing T3 and T5 latencies by changing the configuration of the network interface (transmit and receive thresholds) • In the conducted research we didn’t consider differences between network interfaces (T3 and T5 delays may be longer or shorter than measured) • Latency introduced by switch is also omitted • FastGig project include not only communication library, but also measurement and communication profiling framework

More Related