faster n.
Skip this Video
Loading SlideShow in 5 Seconds..
Faster! PowerPoint Presentation
Download Presentation

Loading in 2 Seconds...

play fullscreen
1 / 42

Faster! - PowerPoint PPT Presentation

  • Uploaded on

Faster!. Vidhyashankar Venkataraman CS614 Presentation. U-Net : A User-Level Network Interface for Parallel and Distributed Computing. Background – Fast Computing. Emergence of MPP – Massively Parallel Processors in the early 90’s

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Faster!' - carol

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript


Vidhyashankar Venkataraman

CS614 Presentation

background fast computing
Background – Fast Computing
  • Emergence of MPP – Massively Parallel Processors in the early 90’s
    • Repackage hardware components to form a dense configuration of very large parallel computing systems
    • But require custom software
  • Alternative : NOW (Berkeley) – Network Of Workstations
    • Formed by inexpensive, low latency, high bandwidth, scalable, interconnect networks of workstations
    • Interconnected through fast switches
    • Challenge: To build a scalable system that is able to use the aggregate resources in the network to execute parallel programs efficiently
  • Problem with traditional networking architectures
    • Software path through kernel involves several copies - processing overhead
    • In faster networks, may not get application speed-up commensurate with network performance
  • Observations:
    • Small messages : Processing overhead is more dominant than network latency
    • Most applications use small messages
      • Eg.. UCB NFS Trace : 50% of bits sent were messages of size 200 bytes or less
issues contd
Issues (contd.)
  • Flexibility concerns:
    • Protocol processing in kernel
    • Greater flexibility if application specific information is integrated into protocol processing
    • Can tune protocol to application’s needs
    • Eg.. Customized retransmission of video frames
u net philosophy
U-Net Philosophy
  • Achieve flexibility and performance by
    • Removing kernel from the critical path
    • Placing entire protocol stack at user level
    • Allowing protected user-level access to network
    • Supplying full bandwidth to small messages
    • Supporting both novel and legacy protocols
do mpps do this
Do MPPs do this?
  • Parallel machines like Meiko CS-2, Thinking Machines CM-5
    • Have tried to solve the problem of providing user-level access to network
    • Use of custom network and network interface – No flexibility
  • U-Net targets applications on standard workstations
    • Using off-the-shelf components
basic u net architecture
Virtualize N/W device so that each process has illusion of owning NI

Mux/ Demuxing device virtualizes the NI

Offers protection!

Kernel removed from critical path

Kernel involved only in setup

Basic U-Net architecture
the u net architecture
Building Blocks

Application End-points

Communication Segment(CS)

Message Queues


Assemble message in CS

EnQ Message Descriptor


Poll-driven/ Event-driven

DeQ Message Descriptor

Consume message

EnQ buffer in free Q

The U-Net Architecture

A region of memory

An application endpoint

u net architecture contd
More on event-handling (upcalls)

Can be UNIX signal handler or user-level interrupt handler

Amortize cost of upcalls by batching receptions

Mux/ Demux :

Each endpoint uniquely identified by a tag (eg.. VCI in ATM)

OS performs initial route setup and security tests and registers a tag in U-Net for that application

The message tag mapped to a communication channel

U-Net Architecture (contd.)
  • Have to preallocate buffers – memory overhead!
  • Protected User-level access to NI : Ensured by demarcating into protection boundaries
    • Defined by endpoints and communication channels
    • Applications cannot interfere with each other because
      • Endpoints, CS and message queues user-owned
      • Outgoing messages tagged with originating endpoint address
      • Incoming messages demuxed by U-Net and sent to correct endpoint
zero copy and true zero copy
Zero-copy and True zero-copy
  • Two levels of sophistication depending on whether copy is made at CS
    • Base-Level Architecture
      • Zero-copy : Copied in an intermediate buffer in the CS
      • CS’es are allocated, aligned, pinned to physical memory
      • Optimization for small messages
    • Direct-access Architecture
      • True zero copy : Data sent directly out of data structure
      • Also specify offset where data has to be deposited
      • CS spans the entire process address space
  • Limitations in I/O Addressing force one to resort to Zero-copy
kernel emulated end point
Communication segments and message queues are scarce resources


Provide a single kernel emulated endpoint

Cost : Performance overhead

Kernel emulated end-point
u net implementation
U-Net Implementation
  • U-Net architectures implemented in two systems
    • Using Fore Systems SBA 100 and 200 ATM network interfaces
    • But why ATM?
    • Setup : SPARCStations 10 and 20 on SunOS 4.1.3 with ASX-200 ATM switch with 140 Mbps fiber links
  • SBA-200 firmware
    • 25 MHz On-board i960 processor, 256 KB RAM, DMA capabilities
    • Complete redesign of firmware
  • Device Driver
    • Protection offered through VM system (CS’es)
    • Also through <VCI, communication channel> mappings
u net performance
U-Net Performance
  • RTT and bandwidth measurements
  • Small messages 65 μs RTT (optimization for single cells)
  • Fiber saturated at 800 B
u net active messages layer
U-Net Active Messages Layer
  • An RPC that can be implemented efficiently on a wide range of hardware
  • A basic communication primitive in NOW
  • Allow overlapping of communication with computation
  • Message contains data & ptr to handler
    • Reliable Message delivery
    • Handler moves data into data structures for some (ongoing) operation
am micro benchmarks
AM – Micro-benchmarks
  • Single-cell RTT
    • RTT ~ 71 μs for a 0-32 B message
    • Overhead of 6 μs over raw U-Net – Why?
  • Block store BW
    • 80% of the maximum limit with blocks of 2KB size
    • Almost saturated at 4KB
    • Good performance!
split c application benchmarks
Split-C application benchmarks
  • Parallel Extension to C
  • Implemented on top of UAM
  • Tested on 8 processors
  • ATM cluster performs close to CS-2
tcp ip and udp ip over u net
TCP/IP and UDP/IP over U-Net
  • Good performance necessary to show flexibility
  • Traditional IP-over-ATM shows very poor performance
    • eg.. TCP : Only 55% of max BW
  • TCP and UDP over U-Net show improved performance
    • Primarily because of tighter application-network coupling
  • IP-over-U-Net:
    • IP-over-ATM does not exactly correspond to IP-over-UNet
    • Demultiplexing for the same VCI is not possible
performance graphs
Performance Graphs

UDP Performance

Saw-tooth behavior for Fore UDP

TCP Performance

U-Net provides virtual view of network interface to enable user-level access to high-speed communication devices

The two main goals were to achieve performance and flexibility

By avoiding kernel in critical path

Achieved? Look at the table below…

  • Small kernel OSes have most services implemented as separate user-level processes
  • Have separate, communicating user processes
    • Improve modular structure
    • More protection
    • Ease of system design and maintenance
  • Cross-domain & cross-machine communication treated equal - Problems?
    • Fails to isolate the common case
    • Performance and Simplicity considerations
  • Measurements show cross-domain predominance
    • V System – 97%
    • Taos Firefly – 94%
    • Sun UNIX+NFS Diskless – 99.4%
    • But how about RPCs these days?
  • Taos takes 109 μs for a Null() local call and 464 μs for RPC – 3.5x overhead
  • Most interactions are simple with small numbers of arguments
    • This could be used to make optimizations
overheads in cross domain calls
Overheads in Cross-domain Calls
  • Stub Overhead – Additional execution path
  • Message buffer overhead – Cross-domain calls can involve four copy operations for any RPC
  • Context switch – VM context switch from client’s domain to the server’s and vice versa on return
  • Scheduling – Abstract and Concrete threads
available solutions
Available solutions?
  • Eliminating kernel copies (DASH system)
  • Handoff scheduling (Mach and Taos)
  • In SRC RPC :
    • Message buffers globally shared!
    • Trades safety for performance
solution proposed lrpcs
Solution proposed : LRPCs
  • Written for the Firefly system
  • Mechanism for communication between protection domains in the same system
  • Motto : Strive for performance without foregoing safety
  • Basic Idea : Similar to RPCs but,
    • Do not context switch to server thread
    • Change the context of the client thread instead, to reduce overhead
overview of lrpcs
Overview of LRPCs
  • Design
    • Client calls server through kernel trap
    • Kernel validates caller
    • Kernel dispatches client thread directly to server’s domain
    • Client provides server with a shared argument stack and its own thread
    • Return through the kernel to the caller
implementation binding
Implementation - Binding






Register with

name server

Trap for





Client Thread

Server thread


Send BO

A-stack list




Allocates A-stacks

Linkage Records

Binding Object (BO)

data structures used and created
Data Structures used and created
  • Kernel receives Procedure Descriptor List (PDL) from Clerk
    • Contains a PD for each procedure
      • Entry Address apart from other information
  • Kernel allocates Argument stacks (A-stacks) shared by client-server domains for each PD
  • Allocates linkage record for each A-Stack to record caller’s address
  • Allocates Binding Object - the client’s key to access the server’s interface
  • Client stub traps kernel for call after
    • Pushing arguments in A-stack
    • Storing BO, procedure identifier, address of A-stack in registers
  • Kernel
    • Validates client, verifies A-stack and locates PD & linkage
    • Stores Return address in linkage and pushes on stack
    • Switches client thread’s context to server by running a new stack E-stack from server’s domain
    • Calls the server’s stub corresponding to PD
  • Server
    • Client thread runs in server’s domain using E-stack
    • Can access parameters of A-stack
    • Return values in A-stack
    • Calls back kernel through stub
stub generation
Stub Generation
  • LRPC stub automatically generated in assembly language for simple execution paths
    • Sacrifices portability for performance
  • Maintains local and remote stubs
    • First instruction in local stub is branch stmt
what are optimized here
What are optimized here?
  • Using the same thread in different domains reduces overhead
    • Avoids scheduling decisions
    • Saves on cost of saving and restoring thread state
  • Pairwise A-stack allocation guarantees protection from third party domain
    • Within? Asynchronous updates?
  • Validate client using BO – To provide security
  • Elimination of redundant copies through use of A-stack!
    • 1 against 4 in traditional cross-domain RPCs
    • Sometimes two? Optimizations apply
but is it really good enough
But… Is it really good enough?
  • Trades off memory management costs for the reduction of overhead
    • A-stacks have to be allocated at bind time
      • But size generally small
  • Will LRPC work even if a server migrates from a remote machine to the local machine?
other issues domain termination
Other Issues – Domain Termination
  • Domain Termination
    • LRPC from terminated server domain should be returned back to the client
    • LRPC should not be sent back to the caller if latter has terminated
  • Use binding objects
    • Revoke binding objects
    • For threads running LRPCs in domain restart new threads in corresponding caller
    • Invalidate active linkage records – thread returned back to first domain with active linkage
    • Otherwise destroyed
multiprocessor issues
Multiprocessor Issues
  • LRPC minimizes use of shared data structures on the critical path
    • Guaranteed by pairwise allocation of A-stacks
  • Cache contexts on idle processors
    • Idling threads in server’s context in idle processors
    • When client thread does RPC to server swap processors
    • Reduces context-switch overhead
evaluation of lrpc
Evaluation of LRPC

Performance of four test programs (time in μs)

(run on CVAX-Firefly averaged over 100000 calls)

cost breakdown for the null lrpc
Minimum refers to the inherent minimum overhead

18 μs spent in client stub and 3 μs in the server stub

25% time spent in TLB misses

Cost Breakdown for the Null LRPC
throughput on a multiprocessor
Tested with Firefly on four C-VAX and one MicroVaxII I/O processors

Speedup of 3.7 with 4 processors as against 1 processor

Speedup of 4.3 with 5 processors

SRC RPCs : inferior performance due to a global lock held during critical transfer path

Throughput on a multiprocessor
  • LRPC Combines
    • Control Transfer and communication model of capability systems
    • Programming semantics and large-grained protection model of RPCs
  • Enhances performance by isolating the common case

We will see ‘NOW’ later in one of the subsequent

614 presentations