1 / 42

Faster! - PowerPoint PPT Presentation

  • Updated On :

Faster!. Vidhyashankar Venkataraman CS614 Presentation. U-Net : A User-Level Network Interface for Parallel and Distributed Computing. Background – Fast Computing. Emergence of MPP – Massively Parallel Processors in the early 90’s

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Faster!' - carol

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript


Vidhyashankar Venkataraman

CS614 Presentation

Background fast computing
Background – Fast Computing Distributed Computing

  • Emergence of MPP – Massively Parallel Processors in the early 90’s

    • Repackage hardware components to form a dense configuration of very large parallel computing systems

    • But require custom software

  • Alternative : NOW (Berkeley) – Network Of Workstations

    • Formed by inexpensive, low latency, high bandwidth, scalable, interconnect networks of workstations

    • Interconnected through fast switches

    • Challenge: To build a scalable system that is able to use the aggregate resources in the network to execute parallel programs efficiently

Issues Distributed Computing

  • Problem with traditional networking architectures

    • Software path through kernel involves several copies - processing overhead

    • In faster networks, may not get application speed-up commensurate with network performance

  • Observations:

    • Small messages : Processing overhead is more dominant than network latency

    • Most applications use small messages

      • Eg.. UCB NFS Trace : 50% of bits sent were messages of size 200 bytes or less

Issues contd
Issues (contd.) Distributed Computing

  • Flexibility concerns:

    • Protocol processing in kernel

    • Greater flexibility if application specific information is integrated into protocol processing

    • Can tune protocol to application’s needs

    • Eg.. Customized retransmission of video frames

U net philosophy
U-Net Philosophy Distributed Computing

  • Achieve flexibility and performance by

    • Removing kernel from the critical path

    • Placing entire protocol stack at user level

    • Allowing protected user-level access to network

    • Supplying full bandwidth to small messages

    • Supporting both novel and legacy protocols

Do mpps do this
Do MPPs do this? Distributed Computing

  • Parallel machines like Meiko CS-2, Thinking Machines CM-5

    • Have tried to solve the problem of providing user-level access to network

    • Use of custom network and network interface – No flexibility

  • U-Net targets applications on standard workstations

    • Using off-the-shelf components

Basic u net architecture

Virtualize N/W device so that each process has illusion of owning NI

Mux/ Demuxing device virtualizes the NI

Offers protection!

Kernel removed from critical path

Kernel involved only in setup

Basic U-Net architecture

The u net architecture

Building Blocks owning NI

Application End-points

Communication Segment(CS)

Message Queues


Assemble message in CS

EnQ Message Descriptor


Poll-driven/ Event-driven

DeQ Message Descriptor

Consume message

EnQ buffer in free Q

The U-Net Architecture

A region of memory

An application endpoint

U net architecture contd

More on event-handling (upcalls) owning NI

Can be UNIX signal handler or user-level interrupt handler

Amortize cost of upcalls by batching receptions

Mux/ Demux :

Each endpoint uniquely identified by a tag (eg.. VCI in ATM)

OS performs initial route setup and security tests and registers a tag in U-Net for that application

The message tag mapped to a communication channel

U-Net Architecture (contd.)

Observations owning NI

  • Have to preallocate buffers – memory overhead!

  • Protected User-level access to NI : Ensured by demarcating into protection boundaries

    • Defined by endpoints and communication channels

    • Applications cannot interfere with each other because

      • Endpoints, CS and message queues user-owned

      • Outgoing messages tagged with originating endpoint address

      • Incoming messages demuxed by U-Net and sent to correct endpoint

Zero copy and true zero copy
Zero-copy and True zero-copy owning NI

  • Two levels of sophistication depending on whether copy is made at CS

    • Base-Level Architecture

      • Zero-copy : Copied in an intermediate buffer in the CS

      • CS’es are allocated, aligned, pinned to physical memory

      • Optimization for small messages

    • Direct-access Architecture

      • True zero copy : Data sent directly out of data structure

      • Also specify offset where data has to be deposited

      • CS spans the entire process address space

  • Limitations in I/O Addressing force one to resort to Zero-copy

Kernel emulated end point

Communication segments and message queues are scarce resources


Provide a single kernel emulated endpoint

Cost : Performance overhead

Kernel emulated end-point

U net implementation
U-Net Implementation resources

  • U-Net architectures implemented in two systems

    • Using Fore Systems SBA 100 and 200 ATM network interfaces

    • But why ATM?

    • Setup : SPARCStations 10 and 20 on SunOS 4.1.3 with ASX-200 ATM switch with 140 Mbps fiber links

  • SBA-200 firmware

    • 25 MHz On-board i960 processor, 256 KB RAM, DMA capabilities

    • Complete redesign of firmware

  • Device Driver

    • Protection offered through VM system (CS’es)

    • Also through <VCI, communication channel> mappings

U net performance
U-Net Performance resources

  • RTT and bandwidth measurements

  • Small messages 65 μs RTT (optimization for single cells)

  • Fiber saturated at 800 B

U net active messages layer
U-Net Active Messages Layer resources

  • An RPC that can be implemented efficiently on a wide range of hardware

  • A basic communication primitive in NOW

  • Allow overlapping of communication with computation

  • Message contains data & ptr to handler

    • Reliable Message delivery

    • Handler moves data into data structures for some (ongoing) operation

Am micro benchmarks
AM – Micro-benchmarks resources

  • Single-cell RTT

    • RTT ~ 71 μs for a 0-32 B message

    • Overhead of 6 μs over raw U-Net – Why?

  • Block store BW

    • 80% of the maximum limit with blocks of 2KB size

    • Almost saturated at 4KB

    • Good performance!

Split c application benchmarks
Split-C application benchmarks resources

  • Parallel Extension to C

  • Implemented on top of UAM

  • Tested on 8 processors

  • ATM cluster performs close to CS-2

Tcp ip and udp ip over u net
TCP/IP and UDP/IP over U-Net resources

  • Good performance necessary to show flexibility

  • Traditional IP-over-ATM shows very poor performance

    • eg.. TCP : Only 55% of max BW

  • TCP and UDP over U-Net show improved performance

    • Primarily because of tighter application-network coupling

  • IP-over-U-Net:

    • IP-over-ATM does not exactly correspond to IP-over-UNet

    • Demultiplexing for the same VCI is not possible

Performance graphs
Performance Graphs resources

UDP Performance

Saw-tooth behavior for Fore UDP

TCP Performance


U-Net provides virtual view of network interface to enable user-level access to high-speed communication devices

The two main goals were to achieve performance and flexibility

By avoiding kernel in critical path

Achieved? Look at the table below…


Lightweight remote procedure calls
Lightweight Remote Procedure Calls user-level access to high-speed communication devices

Motivation user-level access to high-speed communication devices

  • Small kernel OSes have most services implemented as separate user-level processes

  • Have separate, communicating user processes

    • Improve modular structure

    • More protection

    • Ease of system design and maintenance

  • Cross-domain & cross-machine communication treated equal - Problems?

    • Fails to isolate the common case

    • Performance and Simplicity considerations

Measurements user-level access to high-speed communication devices

  • Measurements show cross-domain predominance

    • V System – 97%

    • Taos Firefly – 94%

    • Sun UNIX+NFS Diskless – 99.4%

    • But how about RPCs these days?

  • Taos takes 109 μs for a Null() local call and 464 μs for RPC – 3.5x overhead

  • Most interactions are simple with small numbers of arguments

    • This could be used to make optimizations

Overheads in cross domain calls
Overheads in Cross-domain Calls user-level access to high-speed communication devices

  • Stub Overhead – Additional execution path

  • Message buffer overhead – Cross-domain calls can involve four copy operations for any RPC

  • Context switch – VM context switch from client’s domain to the server’s and vice versa on return

  • Scheduling – Abstract and Concrete threads

Available solutions
Available solutions? user-level access to high-speed communication devices

  • Eliminating kernel copies (DASH system)

  • Handoff scheduling (Mach and Taos)

  • In SRC RPC :

    • Message buffers globally shared!

    • Trades safety for performance

Solution proposed lrpcs
Solution proposed : LRPCs user-level access to high-speed communication devices

  • Written for the Firefly system

  • Mechanism for communication between protection domains in the same system

  • Motto : Strive for performance without foregoing safety

  • Basic Idea : Similar to RPCs but,

    • Do not context switch to server thread

    • Change the context of the client thread instead, to reduce overhead

Overview of lrpcs
Overview of LRPCs user-level access to high-speed communication devices

  • Design

    • Client calls server through kernel trap

    • Kernel validates caller

    • Kernel dispatches client thread directly to server’s domain

    • Client provides server with a shared argument stack and its own thread

    • Return through the kernel to the caller

Implementation binding
Implementation - Binding user-level access to high-speed communication devices






Register with

name server

Trap for





Client Thread

Server thread


Send BO

A-stack list




Allocates A-stacks

Linkage Records

Binding Object (BO)

Data structures used and created
Data Structures used and created user-level access to high-speed communication devices

  • Kernel receives Procedure Descriptor List (PDL) from Clerk

    • Contains a PD for each procedure

      • Entry Address apart from other information

  • Kernel allocates Argument stacks (A-stacks) shared by client-server domains for each PD

  • Allocates linkage record for each A-Stack to record caller’s address

  • Allocates Binding Object - the client’s key to access the server’s interface

Calling user-level access to high-speed communication devices

  • Client stub traps kernel for call after

    • Pushing arguments in A-stack

    • Storing BO, procedure identifier, address of A-stack in registers

  • Kernel

    • Validates client, verifies A-stack and locates PD & linkage

    • Stores Return address in linkage and pushes on stack

    • Switches client thread’s context to server by running a new stack E-stack from server’s domain

    • Calls the server’s stub corresponding to PD

  • Server

    • Client thread runs in server’s domain using E-stack

    • Can access parameters of A-stack

    • Return values in A-stack

    • Calls back kernel through stub

Stub generation
Stub Generation user-level access to high-speed communication devices

  • LRPC stub automatically generated in assembly language for simple execution paths

    • Sacrifices portability for performance

  • Maintains local and remote stubs

    • First instruction in local stub is branch stmt

What are optimized here
What are optimized here? user-level access to high-speed communication devices

  • Using the same thread in different domains reduces overhead

    • Avoids scheduling decisions

    • Saves on cost of saving and restoring thread state

  • Pairwise A-stack allocation guarantees protection from third party domain

    • Within? Asynchronous updates?

  • Validate client using BO – To provide security

  • Elimination of redundant copies through use of A-stack!

    • 1 against 4 in traditional cross-domain RPCs

    • Sometimes two? Optimizations apply

Argument copy
Argument Copy user-level access to high-speed communication devices

But is it really good enough
But… Is it really good enough? user-level access to high-speed communication devices

  • Trades off memory management costs for the reduction of overhead

    • A-stacks have to be allocated at bind time

      • But size generally small

  • Will LRPC work even if a server migrates from a remote machine to the local machine?

Other issues domain termination
Other Issues – Domain Termination user-level access to high-speed communication devices

  • Domain Termination

    • LRPC from terminated server domain should be returned back to the client

    • LRPC should not be sent back to the caller if latter has terminated

  • Use binding objects

    • Revoke binding objects

    • For threads running LRPCs in domain restart new threads in corresponding caller

    • Invalidate active linkage records – thread returned back to first domain with active linkage

    • Otherwise destroyed

Multiprocessor issues
Multiprocessor Issues user-level access to high-speed communication devices

  • LRPC minimizes use of shared data structures on the critical path

    • Guaranteed by pairwise allocation of A-stacks

  • Cache contexts on idle processors

    • Idling threads in server’s context in idle processors

    • When client thread does RPC to server swap processors

    • Reduces context-switch overhead

Evaluation of lrpc
Evaluation of LRPC user-level access to high-speed communication devices

Performance of four test programs (time in μs)

(run on CVAX-Firefly averaged over 100000 calls)

Cost breakdown for the null lrpc

Minimum user-level access to high-speed communication devices refers to the inherent minimum overhead

18 μs spent in client stub and 3 μs in the server stub

25% time spent in TLB misses

Cost Breakdown for the Null LRPC

Throughput on a multiprocessor

Tested with Firefly on four C-VAX and one MicroVaxII I/O processors

Speedup of 3.7 with 4 processors as against 1 processor

Speedup of 4.3 with 5 processors

SRC RPCs : inferior performance due to a global lock held during critical transfer path

Throughput on a multiprocessor

Conclusion processors

  • LRPC Combines

    • Control Transfer and communication model of capability systems

    • Programming semantics and large-grained protection model of RPCs

  • Enhances performance by isolating the common case

NOW processors

We will see ‘NOW’ later in one of the subsequent

614 presentations