Architectural interactions in high performance clusters
This presentation is the property of its rightful owner.
Sponsored Links
1 / 40

Architectural Interactions in High Performance Clusters PowerPoint PPT Presentation


  • 76 Views
  • Uploaded on
  • Presentation posted in: General

Architectural Interactions in High Performance Clusters. RTPP 98 David E. Culler Computer Science Division University of California, Berkeley. Parallel Program. Parallel Program. Parallel Program. Run-Time Framework. ° ° °. RunTime. RunTime. RunTime. Machine Architecture. Machine

Download Presentation

Architectural Interactions in High Performance Clusters

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Architectural interactions in high performance clusters

Architectural Interactions in High Performance Clusters

RTPP 98

David E. Culler

Computer Science Division

University of California, Berkeley


Run time framework

Parallel

Program

Parallel

Program

Parallel

Program

Run-Time Framework

° ° °

RunTime

RunTime

RunTime

Machine

Architecture

Machine

Architecture

Machine

Architecture

Network


Two example runtime layers

Two Example RunTime Layers

  • Split-C

    • thin global address space abstraction over Active Messages

    • get, put, read, write

  • MPI

    • thicker message passing abstraction over Active Messages

    • send, receive


Split c over active messages

Split-C over Active Messages

  • Read, Write, Get, Put built on small Active Message request / reply (RPC)

  • Bulk-Transfer (store & get)

Request

handler

Reply

handler


Model framework logp

Model Framework: LogP

P ( processors )

  • Latency in sending a (small) message between modules

  • overhead felt by the processor on sending or receiving msg

  • gap between successive sends or receives (1/rate)

  • Processors

P

M

P

M

P

M

° ° °

o

o (overhead)

g (gap)

L (latency)

Limited Volume

Interconnection Network

( L/g to a proc)

Round Trip time: 2 x ( 2o + L)


Logp summary of current machines

LogP Summary of Current Machines

Max MB/s: 38 141 47


Methodology

Methodology

  • Apparatus:

    • 35 Ultra 170s (64 MB, .5 MB L2, Solaris 2.5)

    • M2F Lanai + Myricom Net in Fat Tree variant

    • GAM + Split-C

  • Modify the Active Message layer to inflate L, o, g, or G independently

  • Execute a diverse suite of applications and observe effect

  • Evaluate against natural performance models


Adjusting l o and g and g in situ

Adjusting L, o, and g (and G) in situ

Host Workstation

Host Workstation

AM lib

AM lib

O: stall Ultra

on msg write

O: stall Ultra

on msg read

Lanai

Lanai

L: defer marking

msg as valid until

Rx + L

Myrinet

g: delay Lanai

after msg injection

(after fragment for

bulk transfers)


Calibration

Calibration


Applications characteristics

Applications Characteristics

  • Message Frequency

  • Write-based vs. Read-based

  • Short vs. Bulk Messages

  • Synchronization

  • Communication Balance


Applications used in the study

Applications used in the Study


Baseline communication

Baseline Communication


Application sensitivity to communication performance

Application Sensitivity to Communication Performance


Sensitivity to overhead

Sensitivity to Overhead


Sensitivity to gap 1 msg rate

Sensitivity to gap (1/msg rate)


Sensitivity to latency

Sensitivity to Latency


Sensitivity to bulk bw 1 g

Sensitivity to bulk BW (1/G)


Modeling effects of overhead

Modeling Effects of Overhead

  • Tpred = Torig + 2 x max #msgs xo

    • request / response

    • proc with most msgs limits overall time

  • Why does this model under-predict?


Modeling effects of gap

Modeling Effects of gap

  • Uniform communication model

    Tpred = Torig , if g < I, average msg interval

    = Torig + m (g - I ), otherwise

  • Bursty Communication

    Tpred = Torig + m g

g


Extrapolating to low overhead

Extrapolating to Low Overhead


Mpi over am ping pong bandwidth

MPI over AM: ping-pong bandwidth


Mpi over am start up

MPI over AM: start-up


Npb2 speedup now vs sp2

NPB2 Speedup: NOW vs SP2


Now vs origin

NOW vs. Origin


Single processor performance

Single Processor Performance


Understanding speedup

Understanding Speedup

SpeedUp(p) = T1 MAXp (Tcompute + Tcomm. + T wait)

Tcompute = (work/p + extra) x efficiency


Performance tools for clusters

Performance Tools for Clusters

  • Independent data collection on every node

    • Timing

    • Sampling

    • Tracing

  • Little perturbation of global effects


Where the time goes lu a

Where the Time Goes: LU-a


Where the time goes bt a

Where the Time Goes: BT-a


Constant problem size scaling

Constant Problem Size Scaling

4

8

16

32

64

128

256


Communication scaling

Communication Scaling

Normalized Msgs per Proc

Average Message Size


Communication scaling volume

Communication Scaling: Volume


Extra work

Extra Work


Cache working sets lu

Cache Working Sets: LU

8-fold reduction

in miss rate from

4 to 8 proc


Cache working sets bt

Cache Working Sets: BT


Cycles per instruction

Cycles per Instruction


Mpi internal protocol

MPI Internal Protocol

Sender

Receiver


Revised protocol

Revised Protocol

Sender

Receiver


Sensitivity to overhead1

Sensitivity to Overhead


Conclusions

Conclusions

  • Run Time systems for Parallel Programs must deal with a host of architectural interactions

    • communication

    • computation

    • memory system

  • Build a performance model of you RTPP

    • only way to recognize anomalies

  • Build tools along with the RT to reflect characteristics and sensitivity back to PP

  • Much can lurk beneath a perfect speedup curve


  • Login