Exploiting Spatial Parallelism in Ethernet-based Cluster Interconnects

Exploiting Spatial Parallelism in Ethernet-based Cluster Interconnects Stavros Passas, George Kotsis, Sven Karlsson, and Angelos Bilas

Motivation • Typically, clusters today use multiple interconnects • Interprocess communication (IPC): myrinet, infiniband, etc • IO: fibre channel, scsi • Fast LAN: 10 GigE • However, this increases system and management cost • Can we use a single interconnect for all types of traffic? • Which one? • High network speeds • 10-40 GBit/s CARV/scalable

Trends and Constraints • Most interconnects use similar physical layer, but differ in • Protocol semantics and guarantees they provide • Protocol implementation on the NIC and network core • Higher layer protocols (e.g. TCP/IP, NFS) are independent of the interconnect technology • 10+ Gbps Ethernet is particularly attractive, but … • Typically associated with higher overheads • Requires more support at the edge due to simpler net core CARV/scalable

This Work • How well can a protocol do over 10-40 GigE? • Scale throughput efficiently over multiple links • Analyze protocol overhead at the host CPU • Propose and evaluate optimizations for reducing host CPU overhead • Implemented without H/W support CARV/scalable

Outline • Motivation • Protocol design over Ethernet • Experimental results • Conclusions and future work CARV/scalable

Standard Protocol Processing • Sources of overhead • System call to issue operation • Memory copies at sender and receiver • Protocol packet processing • Interrupt notification for freeing send-side buffer, packet arrival • Extensive device accesses • Context switch from interrupt to receive thread for packet processing FORTH-ICS CARV/scalable 6

Our Base Protocol • Improves on MultiEdge [IPDPS’07] • Support for multiple links with different schedulers • H/W coalescing for send- & receive-side interrupts • S/W coalescing in interrupt handler • Still requires • System calls • One copy at send and one at receive side • Context switch in receive path CARV/scalable

Evaluation Methodology Research questions How does the protocol scale with the number of links? What are the important overheads at 10 Gbits/s? What is the impact of link scheduling? We use two nodes connected back-to-back Dual-CPU (Opteron 244) 1-8 links of 1 Gbit/s (Intel) 1 link of 10 Gbit/s (Myricom) We focus on Throughput: end-to-end, reported by benchmarks Detailed CPU breakdowns: extensive kernel instrumentation Packet-level statistics: flow-control, out-of-order CARV/scalable

Throughput Scalability: One Way CARV/scalable

What If… • We were able to avoid certain overheads • Interrupts • Use polling instead • Data copying • Remove copies from send and receive path • We examine two more protocol configurations • Poll: Realistic, but consumes one CPU • NoCopy: Artificial, as data are not delivered CARV/scalable

Poll Results CARV/scalable

NoCopy Results CARV/scalable

Memory Throughput • Copy performance related to memory throughput • Max memory throughput (NUMA w/ Linux support) • Read: 20 GBits/s • Write: 15 GBits/s • Max copy throughput • 8 GBits/s per CPU accessing local memory • Overall, multiple links approach memory throughput • Copies important in future FORTH-ICS CARV/scalable 13

Packet Scheduling for Multiple Links • Evaluated three packet schedulers • Static round robin (SRR) • Suitable for identical links • Weighted static round robin (WSRR) • Assign packets proportionally to link throughput • Does not consider link load • Weighted dynamic (WD) • Assign packets proportionally to link throughput • Consider link load CARV/scalable

Multi-link Scheduler Results Setup • 4x1 + 1x10 • NoCopy + Poll CARV/scalable

Lessons Learned • Multiple links introduce overheads • Base protocol scales up-to 4 x 1 Gbit/s links • Removing interrupts allows scaling to 6 x 1 Gbit/s links • Beyond 6 GBits/s copying becomes dominant • Removing copies allows scaling to 8-10 GBits/s • Dynamic weighted performs best • 10% better over simpler alternative (SWRR) CARV/scalable

Future work • Eliminate even single copy • Use page remapping without H/W support • More efficient interrupt coalescing • Share interrupt handler among multiple NICs • Distribute protocol over multiple cores • Possibly dedicate cores to network processing CARV/scalable

Related Work • User-level communication systems & protocols • Myrinet, Infiniband, etc. • Break kernel abstraction and require h/w support • Not successful with commercial applications and IO • iWARP • Requires H/W support • Ongoing work and efforts • TCP/IP optimizations and offload • Complex and expensive • Important for WAN setups, rather than datacenters CARV/scalable

Thank you!Questions?Contact:Stavros Passasstabat@ics.forth.gr CARV/scalable

Exploiting Spatial Parallelism in Ethernet-based Cluster Interconnects