Cs160 lecture 3
This presentation is the property of its rightful owner.
Sponsored Links
1 / 38

CS160 – Lecture 3 PowerPoint PPT Presentation


  • 75 Views
  • Uploaded on
  • Presentation posted in: General

CS160 – Lecture 3. Clusters. Introduction to PVM and MPI. Introduction to PC Clusters. What are PC Clusters? How are they put together ? Examining the lowest level messaging pipeline Relative application performance Starting with PVM and MPI. Clusters, Beowulfs, and more.

Download Presentation

CS160 – Lecture 3

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Cs160 lecture 3

CS160 – Lecture 3

Clusters. Introduction to PVM and MPI


Introduction to pc clusters

Introduction to PC Clusters

  • What are PC Clusters?

  • How are they put together ?

  • Examining the lowest level messaging pipeline

  • Relative application performance

  • Starting with PVM and MPI


Clusters beowulfs and more

Clusters, Beowulfs, and more

  • How do you put a “Pile-of-PCs” into a room and make them do real work?

    • Interconnection technologies

    • Programming them

    • Monitoring

    • Starting and running applications

    • Running at Scale


Beowulf cluster

Beowulf Cluster

  • Current working definition: a collection of commodity PCs running an open-source operating system with a commodity interconnection network

    • Dual Intel PIIIs with fast ethernet, Linux

      • Program with PVM, MPI, …

    • Single Alpha PCs running Linux


Beowulf clusters cont d

Beowulf Clusters cont’d

  • Interconnection network is usually fast ethernet running TCP/IP

    • (Relatively) slow network

    • Programming model is message passing

  • Most people now associate the name “Beowulf” with any cluster of PCs

    • Beowulf’s are differentiated from high-performance clusters by the network

  • www.beowulf.orghas lots of information


High performance clusters

High-Performance Clusters

Gigabit Networks

- Myrinet, SCI, FC-AL,

Giganet,GigE,ATM

  • Killer micros: Low-cost Gigaflop processors here for a few kilo$$’s /processor

  • Killer networks: Gigabit network hardware, high performance software (e.g. Fast Messages), soon at 100’s-$$$/ connection

  • Leverage HW, commodity SW (*nix/Windows NT), build key technologies

    => high performance computing in a RICH software environment


Cluster research groups

Cluster Research Groups

  • Many other cluster groups that have had impact

    • Active Messages/Network of workstations (NOW) UCB

    • Basic Interface for Parallelism (BIP) Univ. of Lyon

    • Fast Messages(FM)/High Performance Virtual Machines(HPVM) (UIUC/UCSD)

    • Real World Computing Partnership (Japan)

    • (SHRIMP) Scalable High-performance Really Inexpensive Multi-Processor (Princeton)


Clusters are different

Clusters are Different

  • A pile of PC’s is not a large-scale SMP server.

    • Why? Performance and programming model

  • A cluster’s closest cousin is an MPP

    • What’s the major difference? Clusters run N copies of the OS, MPPs usually run one.


Ideal model hpvm s

Application Program

“Virtual Machine Interface”

Actual system configuration

Ideal Model: HPVM’s

  • HPVM = High Performance Virtual Machine

  • Provides a simple uniform programming model, abstracts and encapsulates underlying resource complexity

  • Simplifies use of complex resources


Virtualization of machines

Virtualization of Machines

  • Want the illusion that a collection of machines is a single machines

    • Start, stop, monitor distributed programs

    • Programming and debugging should work seemlessly

    • PVM (Parallel Virtual Machine) was the first, widely-adopted virtualization for parallel computing

  • This illusion is only partially complete in any software system. Some issues

    • Node heterogeneity.

    • Real network topology can lead to contention

  • Unrelated – What is a Java Virtual Machine?


High performance communication

High-Performance Communication

Switched Multigigabit,

User-level access

Networks

Switched 100 Mbit

OS mediated access

  • Level of network interface support + NIC/network router latency

    • Overhead and latency of communication  deliverable bandwidth

  • High-performance communication Programmability!

    • Low-latency, low-overhead, high-bandwidth cluster communication

    • … much more is needed …

      • Usability issues, I/O, Reliability, Availability

      • Remote process debugging/monitoring


Putting a cluster together

Putting a cluster together

  • (16, 32, 64, … X) Individual Node

    • Eg. Dual Processor Pentium III/733, 1 GB mem, ethernet

  • Scalable High-speed network

    • Myrinet, Giganet, Servernet, Gigabit Ethernet

  • Message-passing libraries

    • TCP, MPI, PVM, VIA

  • Multiprocessor job launch

    • Portable batch System

    • Load Sharing Facility

    • PVM spawn, mpirun, rsh

  • Techniques for system management

    • VA Linux Cluster Manager (VACM)

    • High Performance Technologies Inc (HPTI)


Communication style is message passing

Communication style is message Passing

Packetized message

B

A

4

3

2

1

1

2

  • How do we efficiently get a message from Machine A to Machine B?

  • How do we efficiently break a large message into packets and reassemble at receiver?

  • How does receiver differentiate among message fragments (packets) from different senders?


Will use the details of fm to illustrate some communication engineering

Will use the details of FM to illustrate some communication engineering


Fm on commodity pc s

FM on Commodity PC’s

FM Host

Library

FM Device

Driver

FM NIC

Firmware

  • Host Library: API presentation, flow control, segmentation/reassembly, multithreading

  • Device driver: protection, memory mapping, scheduling monitors

  • NIC Firmware: link management, incoming buffer management, routing, multiplexing/demultiplexing

Pentium

II/III

NIC

1280Mbps

~450 MIPS

~33 MIPS

PCI

P6 bus


Fast messages 2 x performance

80

100+ MB/s

70

60

Bandwidth (MB/s)

50

40

30

20

n1/2

10

Msg size (bytes)

0

1,024

4,096

16,384

65,536

4

16

64

256

Fast Messages 2.x Performance

  • Latency 8.8ms, Bandwidth 100+MB/s, N1/2 ~250 bytes

  • Fast in absolute terms (compares to MPP’s, internal memory BW)

  • Delivers a large fraction of hardware performance for short messages

  • Technology transferred in emerging cluster standards Intel/Compaq/Microsoft’s Virtual Interface Architecture.


Comments about performance

Comments about Performance

  • Latency and Bandwidth are the most basic measurements message passing machines

    • Will discuss in detail performance models because

      • Latency and bandwidth do not tell the entire story

  • High-performance clusters exhibit

    • 10X is deliverable bandwidth over ethernet

    • 20X – 30X improvement in latency


How does fm really get speed

How does FM really get Speed?

  • Protected user-level access to network (OS-bypass)

  • Efficient credit-based flow control

    • assumes reliable hardware network [only OK for System Area Networks]

    • No buffer overruns ( stalls sender if no receive space)

  • Early de-multiplexing of incoming packets

    • multithreading, use of NT user-schedulable threads

  • Careful implementation with many tuning cycles

    • Overlapping DMAs (Recv), Programmed I/O send

    • No interrupts! Polling only.


Os bypass background

OS-Bypass Background

  • Suppose you want to perform a sendto on a standard IP socket?

    • Operating System mediates access to the network device

      • Must trap into the kernel to insure authorization on each and every message (Very time consuming)

      • Message is copied from user program to kernel packet buffers

      • Protocol information about each packet is generated by the OS and attached to a packet buffer

      • Message is finally sent out onto the physical device (ethernet)

  • Receiving does the inverse with a recvfrom

    • Packet to kernel buffer, OS strip of header, reassembly of data, OS mediation for authorization, copy into user program


Os bypass

OS-Bypass

  • A user program is given a protected slice of the network interface

    • Authorization is done once (not per message)

  • Outgoing packets get directly copied or DMAed to network interface

    • Protocol headers added by user-level library

  • Incoming packets get routed by network interface card (NIC) into user-defined receive buffers

    • NIC must know how to differentiate incoming packets. This is called early demultiplexing.

  • Outgoing and incoming message copies are eliminated.

  • Traps to OS kernel are eliminated


Packet pathway

NIC

NIC

Packet Pathway

DMA

Programmed I/O

User level

Handler 1

Pkt

Pkt

Pkt

Pkt

User Message Buffer

User level

Handler 2

DMA to/from Network

User Message Buffer

Pkt

Pinned DMA receive

region

  • Concurrency of I/O busses

  • Sender specifies receiver handler ID

  • Flow control keeps DMA region from being overflowed

User Buffer


Fast messages 1 x an example message passing api and library

Fast Messages 1.x – An example message passing API and library

  • API: Berkeley Active Messages

    • Key distinctions: guarantees(reliable, in-order, flow control), network-processor decoupling (DMA region)

  • Focus on short-packet performance:

    • Programmed IO (PIO) instead of DMA

    • Simple buffering and flow control

    • Map I/O device to user space (OS bypass)

Sender:

FM_send(NodeID,Handler,Buffer,size);

// handlers are remote procedures

Receiver:

FM_extract()


What is an active message

What is an active message?

  • Usually, message passing has a send with a corresponding explicit receive at the destination.

  • Active messages specify a function to invoke (activate) when message arrives

    • Function is usually called a message handler

      The handler gets called when the message arrives, not by the destination doing an explicit receive.


Fm 1 x performance 6 95

20

FM

18

16

1Gb Ethernet

14

12

10

Bandwidth(MB/s)

8

6

4

2

0

16

32

64

128

256

512

1024

2048

Msg Size (Bytes)

FM 1.x Performance (6/95)

  • Latency 14 ms, Peak BW 21.4MB/s [Pakin, Lauria et al., Supercomputing95]

  • Hardware limits PIO performance, but N1/2 = 54 bytes

  • Delivers 17.5MB/s @ 128 byte messages (140mbps, greater than OC-3 ATM deliverable)


The fm layering efficiency issue

The FM Layering Efficiency Issue

  • How good is the FM 1.1 API?

  • Test: build a user-level library on top of it and measure the available performance

    • MPI chosen as representative user-level library

    • porting of MPICH 1.0 (ANL/MSU) to FM

  • Purpose: to study what services are important in layering communication libraries

    • integration issues: what kind of inefficiencies arise at the interface, and what is needed to reduce them [Lauria & Chien, JPDC 1997]


Mpi on fm 1 x inefficient layering of protocols

20

15

FM

Bandwidth (MB/s)

10

MPI-FM

5

0

16

32

64

128

256

512

1024

2048

Msg Size

MPI on FM 1.x - Inefficient Layering of Protocols

  • First implementation of MPI on FM was ready in Fall 1995

  • Disappointing performance, only fraction of FM bandwidth available to MPI applications


Mpi fm efficiency

100

90

80

70

60

% Efficiency

50

40

30

20

10

0

16

32

64

128

256

512

1024

2048

Msg Size

MPI-FM Efficiency

  • Result: FM fast, but its interface not efficient


Mpi fm layering inefficiencies

Header

Source buffer

Header

Destination buffer

MPI

FM

MPI-FM Layering Inefficiencies

  • Too many copies due to header attachment/removal, lack of coordination between transport and application layers


Redesign api fm 2 x

Redesign API - FM 2.x

  • Sending

    • FM_begin_message(NodeID, Handler, size)

    • FM_send_piece(stream,buffer,size) // gather

    • FM_end_message()

  • Receiving

    • FM_receive(buffer,size) // scatter

    • FM_extract(total_bytes) // rcvr flow control


Mpi fm 2 x improved layering

MPI

FM

MPI-FM 2.x Improved Layering

Header

Source buffer

Header

Destination buffer

  • Gather-scatter interface + handler multithreading enables efficient layering, data manipulation without copies


Mpi on fm 2 x

100

90

FM

80

70

MPI-FM

60

50

Bandwidth (MB/s)

40

30

20

10

0

8

4

16

32

64

128

256

512

4196

8192

1024

2048

16384

32768

65536

MPI on FM 2.x

  • MPI-FM: 91 MB/s, 13ms latency, ~4 ms overhead

    • Short messages much better than IBM SP2, PCI limited

    • Latency ~ SGI O2K

Msg Size


Mpi fm 2 x efficiency

100

90

80

70

60

50

40

30

20

10

0

4

8

16

32

64

128

256

512

1024

2048

4196

8192

16384

32768

65536

Msg Size

MPI-FM 2.x Efficiency

  • High Transfer Efficiency, approaches 100% [Lauria, Pakin et al. HPDC7 ‘98]

  • Other systems much lower even at 1KB (100Mbit: 40%, 1Gbit: 5%)

% Efficiency


Hpvm iii nt supercluster

HPVM III (“NT Supercluster”)

  • 256xPentium II, April 1998, 77Gflops

    • 3-level fat tree (large switches), scalable bandwidth, modular extensibility

  • => 512xPentium III (550 MHz) Early 2000, 280 GFlops

    • Both with National Center for Supercomputing Applications

280 GF, Early 2000

77 GF, April 1998


Supercomputer performance characteristics

Supercomputer Performance Characteristics

Mflops/ProcFlops/ByteFlops/NetworkRT

Cray T3E1200~2~2,500

SGI Origin2000500~0.5~1,000

HPVM NT Supercluster300~3.2~6,000

Berkeley NOW II100~3.2~2,000

IBM SP2550~3.7~38,000

Beowulf(100Mbit)300~25~200,000

  • Compute/communicate and compute/latency ratios

  • Clusters can provide programmable characteristics at a dramatically lower system cost


Solving 2d navier stokes kernel performance of scalable systems

Solving 2D Navier-Stokes Kernel - Performance of Scalable Systems

Preconditioned Conjugate Gradient Method With

Multi-level Additive Schwarz Richardson Pre-conditioner (2D 1024x1024)

Danesh Tafti, Rob Pennington, NCSA; Andrew Chien (UIUC, UCSD)


Is the detail important is there something easier

Is the detail important? Is there something easier?

  • Detail of a particular high-performance interface illustrates some of the complexity for these systems

    • Performance and scaling are very important. Sometimes the underlying structure needs to be understood to reason about applications.

  • Class will focus on distributed computing algorithms and interfaces at a higher level (message passing)


How do we program run such machines

How do we program/run such machines?

  • PVM (Parallel Virtual Machine) provides

    • Simple message passing API

    • Construction of virtual machine with a software console

    • Ability to spawn (start), kill (stop), monitor jobs

      • XPVM is a graphical console, performance monitor

  • MPI (Message Passing Interface)

    • Complex and complete message passing API

    • Defacto, community-defined standard

    • No defined method for job management

      • Mpirun provided as a tool for the MPICH distribution

    • Commericial and non-commercial tools for monitoring debugging

      • Jumpshot, VaMPIr, …


Next time

Next Time …

  • Parallel Programming Paradigms

    Shared Memory

    Message passing


  • Login