Scalable Networking for Next-Generation Computing Platforms

Scalable Networking for Next-Generation Computing Platforms Yoshio Turner*, Tim Brecht*‡, Greg Regnier§, Vikram Saletore§, John Janakiraman*, Brian Lynn* *Hewlett Packard Laboratories §Intel Corporation ‡University of Waterloo

Outline • Motivation: Enable applications to scale to next-generation network and I/O performance on standard computing platforms • Proposed technology strategy: • Embedded Transport Acceleration (ETA) • Asynchronous I/O (AIO) programming model • Web server application evaluation vehicle • Evaluation Plan • Conclusions SAN-3 workshop – HPCA-10

Motivation: Next-Generation Platform Requirements • Low overhead packet and protocol processing for next-generation commodity interconnects (e.g., 10 gigE) • Current systems: performance impeded by interrupts, context switches, data copies • Existing proposals include: • TCP Offload Engines (TOE): special hardware, cost/time to market issues • RDMA: new protocol, requires support at both endpoints • Increased I/O concurrency for high link utilization • I/O bandwidth is increasing • I/O latency is fixed or slowly decreasing toward limit  Need larger number of in-flight operations to fill pipe SAN-3 workshop – HPCA-10

Proposed Technology Strategy • Embedded Transport Acceleration (ETA) architecture • Intel Labs project: prototype architecture dedicates one or more processors to perform all network packet processing -- ``Packet Processing Engines’’ (PPEs) • Low overhead processing: PPE interacts with network interfaces/applications directly via cache-coherent shared memory (bypass the OS kernel) • Application interface: VIA-style user-level communication • Asynchronous I/O (AIO) programming model • Split two-phase file/socket operations • Post an I/O operation request: non-blocking call • Asynchronously receive completion event information • High I/O concurrency even for single-threaded application • Initial focus: ETA socket AIO (future extensions to file AIO) SAN-3 workshop – HPCA-10

Key Advantages • Potentially enables Ethernet and TCP to approach latency and throughput performance of System Area Networks • Uses standard system processor/memory resources: • Automatically tracks semiconductor cost-performance trends • Leverages microarchitecture trends: multiple cores, hardware multi-threading • Leverages standard software development environments  rapid development • Extensibility: fully programmable PPE to support evolving data center functionality • Unified IP-based fabric for all I/O • RDMA • AIO increases network-centric application scalability SAN-3 workshop – HPCA-10

Overview of the ETA Architecture • Partitioned server architecture: • Host: application execution • Packet Processing Engine (PPE) • Host-PPE Direct Transport Interface (DTI) • VIA/Infiniband-like queuing structures in cache coherent shared host memory (OS bypass) • Optimized for sockets/TCP • Direct User Socket Interface (DUSI) • Thin software layer to support user level applications SAN-3 workshop – HPCA-10

User Applications Host CPU(s) FileSystem KernelApplications Legacy Sockets Direct Access iSCSI ETA Host Interface Shared Memory TCP/IP PPE Driver Network Fabric LAN Storage IPC ETA Overview: Partitioned Architecture SAN-3 workshop – HPCA-10

HOST Shared Host Memory DataBuffers DTIDoorbells DTIEventQueue Anonymous BufferPool DTITxQueue DTIRxQueue Packet Processing Engine ETA Overview: Direct Transport Interface (DTI) Queuing Structure • Asynch socket operations: connect, accept, listen, etc. • TCP buffering semantics – anonymous buffer pool supports non-pre-posted or OOO receive packets SAN-3 workshop – HPCA-10

API for Asynchronous I/O (AIO) • Layer socket AIO API above ETA architecture • Investigate impact of AIO API features on application structure and performance • Initial focus: ETA Direct User Socket Interface (DUSI) API • provides asynchronous socket operations: connect, listen, accept, send, receive • AIO examples: • File/socket: Windows AIO w/completion ports, POSIX AIO • File I/O: Linux AIO recently introduced • Socket I/O with OS bypass: ETA DUSI, OpenGroup Sockets API Extensions SAN-3 workshop – HPCA-10

ETA Direct User Socket Interface (DUSI) AIO API • Queuing structure setup for sockets: • One Direct Transfer Interface (DTI) per socket • Event queues: created separately from DTIs • Memory registration: • Pin user space memory regions, provide address translation information to ETA for zero-copy transfers • Provide access keys (protection tags) • Application posts socket I/O operation requests to DTI Tx and Rx work queues • PPE delivers operation completion events to DTI event queues • Both operation posting and event delivery are lightweight (no OS involvement) SAN-3 workshop – HPCA-10

AIO Event Queue Binding • AIO API design issue: assignment of events to event queues • Flexible binding enables applications to separate or group events to facilitate operation scheduling • DUSI: each DTI work queue can be bound at socket creation to any event queue • Allows separating or grouping events from different sockets • Allows separating events by type (transmit, receive) • Alternatives for event queue binding: • Windows: per-socket • Linux and POSIX AIO: per-operation • OpenGroup Sockets API Extensions: per-operation-type SAN-3 workshop – HPCA-10

Retrieving AIO Completion Events • AIO API design issue: application interface for retrieving events • DUSI: lightweight mechanism bypassing OS • Event queues in shared memory • Callbacks: similar to Windows • Event tags • Application monitoring of multiple event queues • Poll for events (OK for small number of queues) • No events  block in OS on multiple queues • Uncommon case in a busy server  acceptable in this case to use OS signaling mechanism • Useful for simultaneous use of different AIO APIs • Race conditions: user level responsibility SAN-3 workshop – HPCA-10

AIO for Files and Sockets • File AIO support • OS (e.g., Linux AIO, POSIX AIO) • Future: ETA support for file I/O (e.g., via iSCSI or DAFS) • Unified application processing of file/socket events • ETA PPE and OS kernel may both supply event queues • Blocking on event queues of different types facilitated by use of OS signal signal mechanism (as in DUSI) • Unified event queues may be desirable: require efficient coordination of ETA and OS access to event queues • Support for zero-copy sendfile(): integration of ETA with OS management of the shared file buffer in system memory SAN-3 workshop – HPCA-10

Initial Demonstration Vehicle: Web Server Application • Plan: demonstrate value of ETA/AIO for network-centric applications • Initial target: web server application • Single request may require multiple I/Os • Stresses system resources (esp. OS resources) • Must multiplex thousands/tens of thousands concurrent connections • Web server architecture alternatives: • SPED (single process event-driven) • MP (multi-process) or MT (multi-threaded) • Hybrid approach: AMPED (asymmetric multi-process event-driven)  AIO model favors SPED for raw performance SAN-3 workshop – HPCA-10

The userver • Open source micro web server • Extensive tracing and statistics facilities • SPED model -- run one process per host CPU • Previous support for Unix non-blocking socket I/O and event notification via Linux epoll() • Modified to support socket AIO (eventually file AIO) • Generic AIO interface: can be mapped to a variety of underlying AIO APIs (DUSI, Linux AIO, etc.) • Comparison: web server performance with and without ETA engine • With Standard Linux: processes share file buffer cache using sendfile() for zero-copy file transfer • With ETA: mmap() files into shared address space SAN-3 workshop – HPCA-10

Web Server Event Scheduling • Balance accepting new connections with processing of existing connections • Scheduling: • Separate queues for accept(), read(), and write()/close() completion events • Process based on current queue lengths • Early results with non-blocking I/O – accept processing frequency Throughput impact of frequency of accepting new connections SAN-3 workshop – HPCA-10

Evaluation Plans • Goal: evaluate approach, compare to design alternatives • Construct functional prototype of proposed stack (Linux) • Extend existing ETA prototype kernel-level interface to user level with OS bypass (DUSI) • Extend the userver to use socket AIO, mapping layer to DUSI • Evaluate on 10 gigE –based client/server setup using SPECweb type workload • Current ETA prototype: promising kernel-level micro-benchmark performance • Expectation: ETA + AIO will show significantly higher scalability than existing Linux network implementation SAN-3 workshop – HPCA-10

TCP UDP RAW IP Proposed Stack/Comparison uServer - sockets uServer - AIO AIO Mapping ETA Direct User Sockets Interface (DUSI) Linux Sockets Library User Kernel Control Path DTI Data Path Linux Kernel ETA Kernel Agent ETA Packet Processing Engine Packet Driver Network Interfaces SAN-3 workshop – HPCA-10

Kernel-Level ETA Prototype

Evaluation Plans: Analyses and Comparisons • Compare proposed stack to well-tuned conventional system: checksum offload, TCP segmentation offload, interrupt moderation (NAPI) • Examine micro-architectural impacts: VTune/oprofile to get CPU, memory, cache usage, interrupts, data copies, context switches • Comparison to TOE • Extend analysis to application domains beyond web server: e.g., storage, transaction processing • Port highly scalable user-level threading package (UC Berkeley Capriccio project) to ETA • Benefit: familiar threaded programming model with efficient ``under the hood’’ underlying AIO and OS bypass SAN-3 workshop – HPCA-10

Summary • Proposed technology strategy combining ETA and AIO to enable industry standard platforms to scale to next-generation network performance • Cost-performance, time to market, flexibility advantages over alternative approaches • Ethernet/TCP to approach performance levels of today’s SANs – toward unified data center I/O fabric based on commodity hardware • Status • Promising initial experimental results for kernel-level ETA • Prototype implementation of proposed stack nearly complete • Testing environment setup based on 10 gigE SAN-3 workshop – HPCA-10

Scalable Networking for Next-Generation Computing Platforms

Scalable Networking for Next-Generation Computing Platforms

Presentation Transcript

Next Generation Secure Computing Base

Scalable Algorithms for Next-Generation Sequencing Data Analysis

Computing Platforms for Multimedia

Next Generation Secure Computing Base

Scalable, Fault-Tolerant NAS for Oracle - The Next Generation

Efficient Cloud Computing Through Scalable Networking Solutions

Next Generation Networking

Scalable Algorithms for Next-Generation Sequencing Data Analysis

Module 2: Next Generation Networking

Breaking Down the Memory Wall for Future Scalable Computing Platforms

Converged Networking for Next Generation Enterprise Data Centers

gLite, the next generation middleware for Grid computing

Next Generation of Optimized Computing

Next Generation Networking Initiative Funding Opportunities for Industry

Next Generation Networking Initiative Funding Opportunities for Industry

Next generation sequencing platforms Applications

Next Generation Networking

WINLAB and Next-Generation Wireless Networking

Ultra-high-speed all-optical networking technologies for next generation networking

SMTp: An Architecture for Next-generation Scalable Multi-threading

gLite, the next generation middleware for Grid computing

Next Generation Computing Market