1 / 25

An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments

An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments. G. Narayanaswamy , P. Balaji and W. Feng. Dept. of Comp. Science Virginia Tech. Mathematics and Comp. Science Argonne National Laboratory. High-end Computing Trends. High-end Computing (HEC) Systems

lorna
Download Presentation

An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech Mathematics and Comp. Science Argonne National Laboratory

  2. High-end Computing Trends • High-end Computing (HEC) Systems • Continue to increase in scale and capability • Multicore architectures • A significant driving force for this trend • Quad-core processors from Intel/AMD • IBM cell, SUN Niagara, Intel Terascale processor • High-speed Network Interconnects • 10-Gigabit Ethernet (10GE), InfiniBand, Myrinet, Quadrics • Different stacks use different amounts of hardware support • How do these two components interact with each other?

  3. Multicore Architectures • Multi-processor vs. Multicore systems • Not all of the processor hardware is replicated for multicore systems • Hardware units such as cache might be shared between the different cores • Multiple processing units embedded on the same processor die  inter-core communication faster than inter-processor communication • On most architectures (Intel, AMD, SUN), all cores are equally powerful  makes scheduling easier

  4. Interactions of Protocols with Multicores • Depending on how the stack works, different protocols have different interactions with multicore systems • Study based on host-based TCP/IP and iWARP • TCP/IP has significant interaction with multicore systems • Large impacts on application performance • iWARP stack itself does not interact directly with multicore systems • Software libraries built on top of iWARP DO interact (buffering of data, copies) • Interaction similar to other high performance protocols (InfiniBand, Myrinet MX, Qlogic PSM)

  5. TCP/IP Interaction vs. iWARP Interaction App App App App App App Library Library Library Packet Processing TCP/IP stack Host-processing independent of application process (statically tied to a single core) Host-processing closely tied to application process Packet Processing iWARP offloaded Network Network Packet Arrival Packet Arrival TCP/IP is some ways more asynchronous or “centralized” with respect to host-processing as compared to iWARP (or other high performance software stacks)

  6. Presentation Layout • Introduction and Motivation • Treachery of Multicore Architectures • Application Process to Core Mapping Techniques • Conclusions and Future Work

  7. MPI Bandwidth over TCP/IP

  8. MPI Bandwidth over iWARP

  9. TCP/IP Interrupts and Cache Misses

  10. MPI Latency over TCP/IP (Intel Platform)

  11. Presentation Layout • Introduction and Motivation • Treachery of Multicore Architectures • Application Process to Core Mapping Techniques • Conclusions and Future Work

  12. Application Behavior Pre-analysis • A four-core system is effectively a 3.5 core system • A part of a core has to be dedicated to communication • Interrupts, Cache misses • How do we schedule 4 application processes on 3.5 cores? • If the application is exactly synchronized, there is not much we can do • Otherwise, we have an opportunity! • Study with GROMACS and LAMMPS

  13. GROMACS Overview • Developed by Groningen University • Simulates the molecular dynamics of biochemical particles • The root distributes a “topology” file corresponding to the molecular structure • Simulation time broken down into a number of steps • Processes synchronize at each step • Performance reported as number of nanoseconds of molecular interactions that can be simulated each day

  14. Machine 1 cores Machine 2 cores GROMACS: Random Scheduling

  15. Machine 1 cores Machine 2 cores GROMACS: Selective Scheduling

  16. LAMMPS Overview • Molecular dynamics simulator developed at Sandia • Uses spatial decomposition techniques to partition the simulation domain into smaller 3-D subdomains • Each subdomain allotted to a different process • Interaction required only between neighboring subdomains – improves scalability • Used the Lennard-Jones liquid simulation within LAMMPS Core 0 Core 1 Core 2 Core 3 Network Core 0 Core 1 Core 2 Core 3

  17. Machine 1 cores Machine 2 cores LAMMPS: Random Scheduling

  18. LAMMPS: Intended Communication Pattern MPI_Irecv() MPI_Irecv() MPI_Send() MPI_Send() MPI_Wait() MPI_Wait() Computation MPI_Irecv() MPI_Irecv() MPI_Send() MPI_Send()

  19. LAMMPS: Actual Communication Pattern “Slower” Core “Slower” Core Faster Core Faster Core MPI_Send() MPI_Send() MPI_Send() MPI buffer MPI buffer Socket Send Buffer Socket Send Buffer Socket Recv Buffer Socket Recv Buffer Application Recv Buffer Application Recv Buffer MPI_Wait() Computation MPI_Wait() Computation MPI_Send() “Out-of-Sync” Communication between processes Application Recv Buffer

  20. Machine 1 cores Machine 2 cores LAMMPS: Selective Scheduling

  21. Presentation Layout • Introduction and Motivation • Treachery of Multicore Architectures • Application Process to Core Mapping Techniques • Conclusions and Future Work

  22. Concluding Remarks and Future Work • Multicore architectures and high-speed networks are becoming prominent in high-end computing systems • Interaction of these components is important and interesting! • For TCP/IP scheduling order drastically impacts performance • For iWARP scheduling order has no overhead • Scheduling processes in a more intelligent manner allows significantly improved application performance • Does not impact iWARP and other high-performance stack making the approach portable while efficient • Dynamic process to core scheduling!

  23. Thank You Contacts: Ganesh Narayanaswamy: cnganesh@cs.vt.edu Pavan Balaji: balaji@mcs.anl.gov Wu-chun Feng: feng@cs.vt.edu For More Information: http://synergy.cs.vt.edu http://www.mcs.anl.gov/~balaji

  24. Backup Slides

  25. MPI Latency over TCP/IP (AMD Platform)

More Related