1 / 19

Exploiting Spatial Parallelism in Ethernet-based Cluster Interconnects

Exploiting Spatial Parallelism in Ethernet-based Cluster Interconnects. Stavros Passas, George Kotsis, Sven Karlsson, and Angelos Bilas. Motivation. Typically, clusters today use multiple interconnects Interprocess communication (IPC): myrinet, infiniband, etc IO: fibre channel, scsi

tilly
Download Presentation

Exploiting Spatial Parallelism in Ethernet-based Cluster Interconnects

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploiting Spatial Parallelism in Ethernet-based Cluster Interconnects Stavros Passas, George Kotsis, Sven Karlsson, and Angelos Bilas

  2. Motivation • Typically, clusters today use multiple interconnects • Interprocess communication (IPC): myrinet, infiniband, etc • IO: fibre channel, scsi • Fast LAN: 10 GigE • However, this increases system and management cost • Can we use a single interconnect for all types of traffic? • Which one? • High network speeds • 10-40 GBit/s CARV/scalable

  3. Trends and Constraints • Most interconnects use similar physical layer, but differ in • Protocol semantics and guarantees they provide • Protocol implementation on the NIC and network core • Higher layer protocols (e.g. TCP/IP, NFS) are independent of the interconnect technology • 10+ Gbps Ethernet is particularly attractive, but … • Typically associated with higher overheads • Requires more support at the edge due to simpler net core CARV/scalable

  4. This Work • How well can a protocol do over 10-40 GigE? • Scale throughput efficiently over multiple links • Analyze protocol overhead at the host CPU • Propose and evaluate optimizations for reducing host CPU overhead • Implemented without H/W support CARV/scalable

  5. Outline • Motivation • Protocol design over Ethernet • Experimental results • Conclusions and future work CARV/scalable

  6. Standard Protocol Processing • Sources of overhead • System call to issue operation • Memory copies at sender and receiver • Protocol packet processing • Interrupt notification for freeing send-side buffer, packet arrival • Extensive device accesses • Context switch from interrupt to receive thread for packet processing FORTH-ICS CARV/scalable 6

  7. Our Base Protocol • Improves on MultiEdge [IPDPS’07] • Support for multiple links with different schedulers • H/W coalescing for send- & receive-side interrupts • S/W coalescing in interrupt handler • Still requires • System calls • One copy at send and one at receive side • Context switch in receive path CARV/scalable

  8. Evaluation Methodology Research questions How does the protocol scale with the number of links? What are the important overheads at 10 Gbits/s? What is the impact of link scheduling? We use two nodes connected back-to-back Dual-CPU (Opteron 244) 1-8 links of 1 Gbit/s (Intel) 1 link of 10 Gbit/s (Myricom) We focus on Throughput: end-to-end, reported by benchmarks Detailed CPU breakdowns: extensive kernel instrumentation Packet-level statistics: flow-control, out-of-order CARV/scalable

  9. Throughput Scalability: One Way CARV/scalable

  10. What If… • We were able to avoid certain overheads • Interrupts • Use polling instead • Data copying • Remove copies from send and receive path • We examine two more protocol configurations • Poll: Realistic, but consumes one CPU • NoCopy: Artificial, as data are not delivered CARV/scalable

  11. Poll Results CARV/scalable

  12. NoCopy Results CARV/scalable

  13. Memory Throughput • Copy performance related to memory throughput • Max memory throughput (NUMA w/ Linux support) • Read: 20 GBits/s • Write: 15 GBits/s • Max copy throughput • 8 GBits/s per CPU accessing local memory • Overall, multiple links approach memory throughput • Copies important in future FORTH-ICS CARV/scalable 13

  14. Packet Scheduling for Multiple Links • Evaluated three packet schedulers • Static round robin (SRR) • Suitable for identical links • Weighted static round robin (WSRR) • Assign packets proportionally to link throughput • Does not consider link load • Weighted dynamic (WD) • Assign packets proportionally to link throughput • Consider link load CARV/scalable

  15. Multi-link Scheduler Results Setup • 4x1 + 1x10 • NoCopy + Poll CARV/scalable

  16. Lessons Learned • Multiple links introduce overheads • Base protocol scales up-to 4 x 1 Gbit/s links • Removing interrupts allows scaling to 6 x 1 Gbit/s links • Beyond 6 GBits/s copying becomes dominant • Removing copies allows scaling to 8-10 GBits/s • Dynamic weighted performs best • 10% better over simpler alternative (SWRR) CARV/scalable

  17. Future work • Eliminate even single copy • Use page remapping without H/W support • More efficient interrupt coalescing • Share interrupt handler among multiple NICs • Distribute protocol over multiple cores • Possibly dedicate cores to network processing CARV/scalable

  18. Related Work • User-level communication systems & protocols • Myrinet, Infiniband, etc. • Break kernel abstraction and require h/w support • Not successful with commercial applications and IO • iWARP • Requires H/W support • Ongoing work and efforts • TCP/IP optimizations and offload • Complex and expensive • Important for WAN setups, rather than datacenters CARV/scalable

  19. Thank you!Questions?Contact:Stavros Passasstabat@ics.forth.gr CARV/scalable

More Related