Faster!

Faster! Vidhyashankar Venkataraman CS614 Presentation

U-Net : A User-Level Network Interface for Parallel and Distributed Computing

Background – Fast Computing • Emergence of MPP – Massively Parallel Processors in the early 90’s • Repackage hardware components to form a dense configuration of very large parallel computing systems • But require custom software • Alternative : NOW (Berkeley) – Network Of Workstations • Formed by inexpensive, low latency, high bandwidth, scalable, interconnect networks of workstations • Interconnected through fast switches • Challenge: To build a scalable system that is able to use the aggregate resources in the network to execute parallel programs efficiently

Issues • Problem with traditional networking architectures • Software path through kernel involves several copies - processing overhead • In faster networks, may not get application speed-up commensurate with network performance • Observations: • Small messages : Processing overhead is more dominant than network latency • Most applications use small messages • Eg.. UCB NFS Trace : 50% of bits sent were messages of size 200 bytes or less

Issues (contd.) • Flexibility concerns: • Protocol processing in kernel • Greater flexibility if application specific information is integrated into protocol processing • Can tune protocol to application’s needs • Eg.. Customized retransmission of video frames

U-Net Philosophy • Achieve flexibility and performance by • Removing kernel from the critical path • Placing entire protocol stack at user level • Allowing protected user-level access to network • Supplying full bandwidth to small messages • Supporting both novel and legacy protocols

Do MPPs do this? • Parallel machines like Meiko CS-2, Thinking Machines CM-5 • Have tried to solve the problem of providing user-level access to network • Use of custom network and network interface – No flexibility • U-Net targets applications on standard workstations • Using off-the-shelf components

Virtualize N/W device so that each process has illusion of owning NI Mux/ Demuxing device virtualizes the NI Offers protection! Kernel removed from critical path Kernel involved only in setup Basic U-Net architecture

Building Blocks Application End-points Communication Segment(CS) Message Queues Sending Assemble message in CS EnQ Message Descriptor Receiving Poll-driven/ Event-driven DeQ Message Descriptor Consume message EnQ buffer in free Q The U-Net Architecture A region of memory An application endpoint

More on event-handling (upcalls) Can be UNIX signal handler or user-level interrupt handler Amortize cost of upcalls by batching receptions Mux/ Demux : Each endpoint uniquely identified by a tag (eg.. VCI in ATM) OS performs initial route setup and security tests and registers a tag in U-Net for that application The message tag mapped to a communication channel U-Net Architecture (contd.)

Observations • Have to preallocate buffers – memory overhead! • Protected User-level access to NI : Ensured by demarcating into protection boundaries • Defined by endpoints and communication channels • Applications cannot interfere with each other because • Endpoints, CS and message queues user-owned • Outgoing messages tagged with originating endpoint address • Incoming messages demuxed by U-Net and sent to correct endpoint

Zero-copy and True zero-copy • Two levels of sophistication depending on whether copy is made at CS • Base-Level Architecture • Zero-copy : Copied in an intermediate buffer in the CS • CS’es are allocated, aligned, pinned to physical memory • Optimization for small messages • Direct-access Architecture • True zero copy : Data sent directly out of data structure • Also specify offset where data has to be deposited • CS spans the entire process address space • Limitations in I/O Addressing force one to resort to Zero-copy

Communication segments and message queues are scarce resources Optimization: Provide a single kernel emulated endpoint Cost : Performance overhead Kernel emulated end-point

U-Net Implementation • U-Net architectures implemented in two systems • Using Fore Systems SBA 100 and 200 ATM network interfaces • But why ATM? • Setup : SPARCStations 10 and 20 on SunOS 4.1.3 with ASX-200 ATM switch with 140 Mbps fiber links • SBA-200 firmware • 25 MHz On-board i960 processor, 256 KB RAM, DMA capabilities • Complete redesign of firmware • Device Driver • Protection offered through VM system (CS’es) • Also through <VCI, communication channel> mappings

U-Net Performance • RTT and bandwidth measurements • Small messages 65 μs RTT (optimization for single cells) • Fiber saturated at 800 B

U-Net Active Messages Layer • An RPC that can be implemented efficiently on a wide range of hardware • A basic communication primitive in NOW • Allow overlapping of communication with computation • Message contains data & ptr to handler • Reliable Message delivery • Handler moves data into data structures for some (ongoing) operation

AM – Micro-benchmarks • Single-cell RTT • RTT ~ 71 μs for a 0-32 B message • Overhead of 6 μs over raw U-Net – Why? • Block store BW • 80% of the maximum limit with blocks of 2KB size • Almost saturated at 4KB • Good performance!

Split-C application benchmarks • Parallel Extension to C • Implemented on top of UAM • Tested on 8 processors • ATM cluster performs close to CS-2

TCP/IP and UDP/IP over U-Net • Good performance necessary to show flexibility • Traditional IP-over-ATM shows very poor performance • eg.. TCP : Only 55% of max BW • TCP and UDP over U-Net show improved performance • Primarily because of tighter application-network coupling • IP-over-U-Net: • IP-over-ATM does not exactly correspond to IP-over-UNet • Demultiplexing for the same VCI is not possible

Performance Graphs UDP Performance Saw-tooth behavior for Fore UDP TCP Performance

U-Net provides virtual view of network interface to enable user-level access to high-speed communication devices The two main goals were to achieve performance and flexibility By avoiding kernel in critical path Achieved? Look at the table below… Conclusion

Lightweight Remote Procedure Calls

Motivation • Small kernel OSes have most services implemented as separate user-level processes • Have separate, communicating user processes • Improve modular structure • More protection • Ease of system design and maintenance • Cross-domain & cross-machine communication treated equal - Problems? • Fails to isolate the common case • Performance and Simplicity considerations

Measurements • Measurements show cross-domain predominance • V System – 97% • Taos Firefly – 94% • Sun UNIX+NFS Diskless – 99.4% • But how about RPCs these days? • Taos takes 109 μs for a Null() local call and 464 μs for RPC – 3.5x overhead • Most interactions are simple with small numbers of arguments • This could be used to make optimizations

Overheads in Cross-domain Calls • Stub Overhead – Additional execution path • Message buffer overhead – Cross-domain calls can involve four copy operations for any RPC • Context switch – VM context switch from client’s domain to the server’s and vice versa on return • Scheduling – Abstract and Concrete threads

Available solutions? • Eliminating kernel copies (DASH system) • Handoff scheduling (Mach and Taos) • In SRC RPC : • Message buffers globally shared! • Trades safety for performance

Solution proposed : LRPCs • Written for the Firefly system • Mechanism for communication between protection domains in the same system • Motto : Strive for performance without foregoing safety • Basic Idea : Similar to RPCs but, • Do not context switch to server thread • Change the context of the client thread instead, to reduce overhead

Overview of LRPCs • Design • Client calls server through kernel trap • Kernel validates caller • Kernel dispatches client thread directly to server’s domain • Client provides server with a shared argument stack and its own thread • Return through the kernel to the caller

Implementation - Binding Server Client Kernel Export interface Register with name server Trap for import Notify Clerk Wait Client Thread Server thread Clerk Send BO A-stack list Send PDL Processing: Allocates A-stacks Linkage Records Binding Object (BO)

Data Structures used and created • Kernel receives Procedure Descriptor List (PDL) from Clerk • Contains a PD for each procedure • Entry Address apart from other information • Kernel allocates Argument stacks (A-stacks) shared by client-server domains for each PD • Allocates linkage record for each A-Stack to record caller’s address • Allocates Binding Object - the client’s key to access the server’s interface

Calling • Client stub traps kernel for call after • Pushing arguments in A-stack • Storing BO, procedure identifier, address of A-stack in registers • Kernel • Validates client, verifies A-stack and locates PD & linkage • Stores Return address in linkage and pushes on stack • Switches client thread’s context to server by running a new stack E-stack from server’s domain • Calls the server’s stub corresponding to PD • Server • Client thread runs in server’s domain using E-stack • Can access parameters of A-stack • Return values in A-stack • Calls back kernel through stub

Stub Generation • LRPC stub automatically generated in assembly language for simple execution paths • Sacrifices portability for performance • Maintains local and remote stubs • First instruction in local stub is branch stmt

What are optimized here? • Using the same thread in different domains reduces overhead • Avoids scheduling decisions • Saves on cost of saving and restoring thread state • Pairwise A-stack allocation guarantees protection from third party domain • Within? Asynchronous updates? • Validate client using BO – To provide security • Elimination of redundant copies through use of A-stack! • 1 against 4 in traditional cross-domain RPCs • Sometimes two? Optimizations apply

Argument Copy

But… Is it really good enough? • Trades off memory management costs for the reduction of overhead • A-stacks have to be allocated at bind time • But size generally small • Will LRPC work even if a server migrates from a remote machine to the local machine?

Other Issues – Domain Termination • Domain Termination • LRPC from terminated server domain should be returned back to the client • LRPC should not be sent back to the caller if latter has terminated • Use binding objects • Revoke binding objects • For threads running LRPCs in domain restart new threads in corresponding caller • Invalidate active linkage records – thread returned back to first domain with active linkage • Otherwise destroyed

Multiprocessor Issues • LRPC minimizes use of shared data structures on the critical path • Guaranteed by pairwise allocation of A-stacks • Cache contexts on idle processors • Idling threads in server’s context in idle processors • When client thread does RPC to server swap processors • Reduces context-switch overhead

Evaluation of LRPC Performance of four test programs (time in μs) (run on CVAX-Firefly averaged over 100000 calls)

Minimum refers to the inherent minimum overhead 18 μs spent in client stub and 3 μs in the server stub 25% time spent in TLB misses Cost Breakdown for the Null LRPC

Tested with Firefly on four C-VAX and one MicroVaxII I/O processors Speedup of 3.7 with 4 processors as against 1 processor Speedup of 4.3 with 5 processors SRC RPCs : inferior performance due to a global lock held during critical transfer path Throughput on a multiprocessor

Conclusion • LRPC Combines • Control Transfer and communication model of capability systems • Programming semantics and large-grained protection model of RPCs • Enhances performance by isolating the common case

NOW We will see ‘NOW’ later in one of the subsequent 614 presentations

Faster!

Faster!

Presentation Transcript

Swimming Faster

Fleet 101 A Road Course through FASTER October 17, 2007

Deploying Your Mobile Technician

An introduction to SQLTrace, TKPROF and Execution Plans

Hydrodynamic Techniques electrophoresis centrifugation

In what book do thunder and lightening cause animals to gallop faster and faster?

Tkprof: Who What How When Where Why -or-

“ Life’s battles don’t always go To the stronger or the faster man,

Chapter 8: Rotational Motion

What is person-centred care?

YOUR CHALLENGE

Yes, you think it will only happen to the other guy.

The Foolish Turtle

Faster ramps for the LHC magnets?

Container Office: Easily Re-locatable, faster to setup an office

Ramping faster?

Smart Technologies for Effective Reconfiguration: The FASTER approach

Getting Data Out of FASTER : Tips for the New & Experienced

Recall this plot from the camera lectures…

Physics 117

Even Faster Web Sites (inside ma.tt) best practices for faster pages

Faster!

Faster!

Presentation Transcript

Swimming Faster

Fleet 101 A Road Course through FASTER October 17, 2007

Deploying Your Mobile Technician

An introduction to SQLTrace, TKPROF and Execution Plans

Hydrodynamic Techniques electrophoresis centrifugation

In what book do thunder and lightening cause animals to gallop faster and faster?

Tkprof: Who What How When Where Why -or-

“ Life’s battles don’t always go To the stronger or the faster man,

Chapter 8: Rotational Motion

What is person-centred care?

YOUR CHALLENGE

Yes, you think it will only happen to the other guy.

The Foolish Turtle

Faster ramps for the LHC magnets?

Container Office: Easily Re-locatable, faster to setup an office

Ramping faster?

Smart Technologies for Effective Reconfiguration: The FASTER approach

Getting Data Out of FASTER : Tips for the New &amp; Experienced

Recall this plot from the camera lectures…

Physics 117

Even Faster Web Sites (inside ma.tt) best practices for faster pages

Getting Data Out of FASTER : Tips for the New & Experienced