Cs160 lecture 3
Download
1 / 38

CS160 – Lecture 3 - PowerPoint PPT Presentation


  • 97 Views
  • Uploaded on

CS160 – Lecture 3 . Clusters. Introduction to PVM and MPI. Introduction to PC Clusters. What are PC Clusters? How are they put together ? Examining the lowest level messaging pipeline Relative application performance Starting with PVM and MPI. Clusters, Beowulfs, and more.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'CS160 – Lecture 3' - shing


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Cs160 lecture 3

CS160 – Lecture 3

Clusters. Introduction to PVM and MPI


Introduction to pc clusters
Introduction to PC Clusters

  • What are PC Clusters?

  • How are they put together ?

  • Examining the lowest level messaging pipeline

  • Relative application performance

  • Starting with PVM and MPI


Clusters beowulfs and more
Clusters, Beowulfs, and more

  • How do you put a “Pile-of-PCs” into a room and make them do real work?

    • Interconnection technologies

    • Programming them

    • Monitoring

    • Starting and running applications

    • Running at Scale


Beowulf cluster
Beowulf Cluster

  • Current working definition: a collection of commodity PCs running an open-source operating system with a commodity interconnection network

    • Dual Intel PIIIs with fast ethernet, Linux

      • Program with PVM, MPI, …

    • Single Alpha PCs running Linux


Beowulf clusters cont d
Beowulf Clusters cont’d

  • Interconnection network is usually fast ethernet running TCP/IP

    • (Relatively) slow network

    • Programming model is message passing

  • Most people now associate the name “Beowulf” with any cluster of PCs

    • Beowulf’s are differentiated from high-performance clusters by the network

  • www.beowulf.orghas lots of information


High performance clusters
High-Performance Clusters

Gigabit Networks

- Myrinet, SCI, FC-AL,

Giganet,GigE,ATM

  • Killer micros: Low-cost Gigaflop processors here for a few kilo$$’s /processor

  • Killer networks: Gigabit network hardware, high performance software (e.g. Fast Messages), soon at 100’s-$$$/ connection

  • Leverage HW, commodity SW (*nix/Windows NT), build key technologies

    => high performance computing in a RICH software environment


Cluster research groups
Cluster Research Groups

  • Many other cluster groups that have had impact

    • Active Messages/Network of workstations (NOW) UCB

    • Basic Interface for Parallelism (BIP) Univ. of Lyon

    • Fast Messages(FM)/High Performance Virtual Machines(HPVM) (UIUC/UCSD)

    • Real World Computing Partnership (Japan)

    • (SHRIMP) Scalable High-performance Really Inexpensive Multi-Processor (Princeton)


Clusters are different
Clusters are Different

  • A pile of PC’s is not a large-scale SMP server.

    • Why? Performance and programming model

  • A cluster’s closest cousin is an MPP

    • What’s the major difference? Clusters run N copies of the OS, MPPs usually run one.


Ideal model hpvm s

Application Program

“Virtual Machine Interface”

Actual system configuration

Ideal Model: HPVM’s

  • HPVM = High Performance Virtual Machine

  • Provides a simple uniform programming model, abstracts and encapsulates underlying resource complexity

  • Simplifies use of complex resources


Virtualization of machines
Virtualization of Machines

  • Want the illusion that a collection of machines is a single machines

    • Start, stop, monitor distributed programs

    • Programming and debugging should work seemlessly

    • PVM (Parallel Virtual Machine) was the first, widely-adopted virtualization for parallel computing

  • This illusion is only partially complete in any software system. Some issues

    • Node heterogeneity.

    • Real network topology can lead to contention

  • Unrelated – What is a Java Virtual Machine?


High performance communication
High-Performance Communication

Switched Multigigabit,

User-level access

Networks

Switched 100 Mbit

OS mediated access

  • Level of network interface support + NIC/network router latency

    • Overhead and latency of communication  deliverable bandwidth

  • High-performance communication Programmability!

    • Low-latency, low-overhead, high-bandwidth cluster communication

    • … much more is needed …

      • Usability issues, I/O, Reliability, Availability

      • Remote process debugging/monitoring


Putting a cluster together
Putting a cluster together

  • (16, 32, 64, … X) Individual Node

    • Eg. Dual Processor Pentium III/733, 1 GB mem, ethernet

  • Scalable High-speed network

    • Myrinet, Giganet, Servernet, Gigabit Ethernet

  • Message-passing libraries

    • TCP, MPI, PVM, VIA

  • Multiprocessor job launch

    • Portable batch System

    • Load Sharing Facility

    • PVM spawn, mpirun, rsh

  • Techniques for system management

    • VA Linux Cluster Manager (VACM)

    • High Performance Technologies Inc (HPTI)


Communication style is message passing
Communication style is message Passing

Packetized message

B

A

4

3

2

1

1

2

  • How do we efficiently get a message from Machine A to Machine B?

  • How do we efficiently break a large message into packets and reassemble at receiver?

  • How does receiver differentiate among message fragments (packets) from different senders?



Fm on commodity pc s
FM on Commodity PC’s engineering

FM Host

Library

FM Device

Driver

FM NIC

Firmware

  • Host Library: API presentation, flow control, segmentation/reassembly, multithreading

  • Device driver: protection, memory mapping, scheduling monitors

  • NIC Firmware: link management, incoming buffer management, routing, multiplexing/demultiplexing

Pentium

II/III

NIC

1280Mbps

~450 MIPS

~33 MIPS

PCI

P6 bus


Fast messages 2 x performance

80 engineering

100+ MB/s

70

60

Bandwidth (MB/s)

50

40

30

20

n1/2

10

Msg size (bytes)

0

1,024

4,096

16,384

65,536

4

16

64

256

Fast Messages 2.x Performance

  • Latency 8.8ms, Bandwidth 100+MB/s, N1/2 ~250 bytes

  • Fast in absolute terms (compares to MPP’s, internal memory BW)

  • Delivers a large fraction of hardware performance for short messages

  • Technology transferred in emerging cluster standards Intel/Compaq/Microsoft’s Virtual Interface Architecture.


Comments about performance
Comments about Performance engineering

  • Latency and Bandwidth are the most basic measurements message passing machines

    • Will discuss in detail performance models because

      • Latency and bandwidth do not tell the entire story

  • High-performance clusters exhibit

    • 10X is deliverable bandwidth over ethernet

    • 20X – 30X improvement in latency


How does fm really get speed
How does FM really get Speed? engineering

  • Protected user-level access to network (OS-bypass)

  • Efficient credit-based flow control

    • assumes reliable hardware network [only OK for System Area Networks]

    • No buffer overruns ( stalls sender if no receive space)

  • Early de-multiplexing of incoming packets

    • multithreading, use of NT user-schedulable threads

  • Careful implementation with many tuning cycles

    • Overlapping DMAs (Recv), Programmed I/O send

    • No interrupts! Polling only.


Os bypass background
OS-Bypass Background engineering

  • Suppose you want to perform a sendto on a standard IP socket?

    • Operating System mediates access to the network device

      • Must trap into the kernel to insure authorization on each and every message (Very time consuming)

      • Message is copied from user program to kernel packet buffers

      • Protocol information about each packet is generated by the OS and attached to a packet buffer

      • Message is finally sent out onto the physical device (ethernet)

  • Receiving does the inverse with a recvfrom

    • Packet to kernel buffer, OS strip of header, reassembly of data, OS mediation for authorization, copy into user program


Os bypass
OS-Bypass engineering

  • A user program is given a protected slice of the network interface

    • Authorization is done once (not per message)

  • Outgoing packets get directly copied or DMAed to network interface

    • Protocol headers added by user-level library

  • Incoming packets get routed by network interface card (NIC) into user-defined receive buffers

    • NIC must know how to differentiate incoming packets. This is called early demultiplexing.

  • Outgoing and incoming message copies are eliminated.

  • Traps to OS kernel are eliminated


Packet pathway

NIC engineering

NIC

Packet Pathway

DMA

Programmed I/O

User level

Handler 1

Pkt

Pkt

Pkt

Pkt

User Message Buffer

User level

Handler 2

DMA to/from Network

User Message Buffer

Pkt

Pinned DMA receive

region

  • Concurrency of I/O busses

  • Sender specifies receiver handler ID

  • Flow control keeps DMA region from being overflowed

User Buffer


Fast messages 1 x an example message passing api and library
Fast Messages 1.x – An example message passing API and library

  • API: Berkeley Active Messages

    • Key distinctions: guarantees(reliable, in-order, flow control), network-processor decoupling (DMA region)

  • Focus on short-packet performance:

    • Programmed IO (PIO) instead of DMA

    • Simple buffering and flow control

    • Map I/O device to user space (OS bypass)

Sender:

FM_send(NodeID,Handler,Buffer,size);

// handlers are remote procedures

Receiver:

FM_extract()


What is an active message
What is an active message? library

  • Usually, message passing has a send with a corresponding explicit receive at the destination.

  • Active messages specify a function to invoke (activate) when message arrives

    • Function is usually called a message handler

      The handler gets called when the message arrives, not by the destination doing an explicit receive.


Fm 1 x performance 6 95

20 library

FM

18

16

1Gb Ethernet

14

12

10

Bandwidth(MB/s)

8

6

4

2

0

16

32

64

128

256

512

1024

2048

Msg Size (Bytes)

FM 1.x Performance (6/95)

  • Latency 14 ms, Peak BW 21.4MB/s [Pakin, Lauria et al., Supercomputing95]

  • Hardware limits PIO performance, but N1/2 = 54 bytes

  • Delivers 17.5MB/s @ 128 byte messages (140mbps, greater than OC-3 ATM deliverable)


The fm layering efficiency issue
The FM Layering Efficiency Issue library

  • How good is the FM 1.1 API?

  • Test: build a user-level library on top of it and measure the available performance

    • MPI chosen as representative user-level library

    • porting of MPICH 1.0 (ANL/MSU) to FM

  • Purpose: to study what services are important in layering communication libraries

    • integration issues: what kind of inefficiencies arise at the interface, and what is needed to reduce them [Lauria & Chien, JPDC 1997]


Mpi on fm 1 x inefficient layering of protocols

20 library

15

FM

Bandwidth (MB/s)

10

MPI-FM

5

0

16

32

64

128

256

512

1024

2048

Msg Size

MPI on FM 1.x - Inefficient Layering of Protocols

  • First implementation of MPI on FM was ready in Fall 1995

  • Disappointing performance, only fraction of FM bandwidth available to MPI applications


Mpi fm efficiency

100 library

90

80

70

60

% Efficiency

50

40

30

20

10

0

16

32

64

128

256

512

1024

2048

Msg Size

MPI-FM Efficiency

  • Result: FM fast, but its interface not efficient


Mpi fm layering inefficiencies

Header library

Source buffer

Header

Destination buffer

MPI

FM

MPI-FM Layering Inefficiencies

  • Too many copies due to header attachment/removal, lack of coordination between transport and application layers


Redesign api fm 2 x
Redesign API - FM 2.x library

  • Sending

    • FM_begin_message(NodeID, Handler, size)

    • FM_send_piece(stream,buffer,size) // gather

    • FM_end_message()

  • Receiving

    • FM_receive(buffer,size) // scatter

    • FM_extract(total_bytes) // rcvr flow control


Mpi fm 2 x improved layering

MPI library

FM

MPI-FM 2.x Improved Layering

Header

Source buffer

Header

Destination buffer

  • Gather-scatter interface + handler multithreading enables efficient layering, data manipulation without copies


Mpi on fm 2 x

100 library

90

FM

80

70

MPI-FM

60

50

Bandwidth (MB/s)

40

30

20

10

0

8

4

16

32

64

128

256

512

4196

8192

1024

2048

16384

32768

65536

MPI on FM 2.x

  • MPI-FM: 91 MB/s, 13ms latency, ~4 ms overhead

    • Short messages much better than IBM SP2, PCI limited

    • Latency ~ SGI O2K

Msg Size


Mpi fm 2 x efficiency

100 library

90

80

70

60

50

40

30

20

10

0

4

8

16

32

64

128

256

512

1024

2048

4196

8192

16384

32768

65536

Msg Size

MPI-FM 2.x Efficiency

  • High Transfer Efficiency, approaches 100% [Lauria, Pakin et al. HPDC7 ‘98]

  • Other systems much lower even at 1KB (100Mbit: 40%, 1Gbit: 5%)

% Efficiency


Hpvm iii nt supercluster
HPVM III (“NT Supercluster”) library

  • 256xPentium II, April 1998, 77Gflops

    • 3-level fat tree (large switches), scalable bandwidth, modular extensibility

  • => 512xPentium III (550 MHz) Early 2000, 280 GFlops

    • Both with National Center for Supercomputing Applications

280 GF, Early 2000

77 GF, April 1998


Supercomputer performance characteristics
Supercomputer Performance Characteristics library

Mflops/ProcFlops/ByteFlops/NetworkRT

Cray T3E 1200 ~2 ~2,500

SGI Origin2000 500 ~0.5 ~1,000

HPVM NT Supercluster 300 ~3.2 ~6,000

Berkeley NOW II 100 ~3.2 ~2,000

IBM SP2 550 ~3.7 ~38,000

Beowulf (100Mbit) 300 ~25 ~200,000

  • Compute/communicate and compute/latency ratios

  • Clusters can provide programmable characteristics at a dramatically lower system cost


Solving 2d navier stokes kernel performance of scalable systems
Solving 2D Navier-Stokes Kernel library - Performance of Scalable Systems

Preconditioned Conjugate Gradient Method With

Multi-level Additive Schwarz Richardson Pre-conditioner (2D 1024x1024)

Danesh Tafti, Rob Pennington, NCSA; Andrew Chien (UIUC, UCSD)


Is the detail important is there something easier
Is the detail important? Is there something easier? library

  • Detail of a particular high-performance interface illustrates some of the complexity for these systems

    • Performance and scaling are very important. Sometimes the underlying structure needs to be understood to reason about applications.

  • Class will focus on distributed computing algorithms and interfaces at a higher level (message passing)


How do we program run such machines
How do we program/run such machines? library

  • PVM (Parallel Virtual Machine) provides

    • Simple message passing API

    • Construction of virtual machine with a software console

    • Ability to spawn (start), kill (stop), monitor jobs

      • XPVM is a graphical console, performance monitor

  • MPI (Message Passing Interface)

    • Complex and complete message passing API

    • Defacto, community-defined standard

    • No defined method for job management

      • Mpirun provided as a tool for the MPICH distribution

    • Commericial and non-commercial tools for monitoring debugging

      • Jumpshot, VaMPIr, …


Next time
Next Time … library

  • Parallel Programming Paradigms

    Shared Memory

    Message passing


ad