Architectural interactions in high performance clusters
Download
1 / 40

Architectural Interactions in High Performance Clusters - PowerPoint PPT Presentation


  • 107 Views
  • Uploaded on

Architectural Interactions in High Performance Clusters. RTPP 98 David E. Culler Computer Science Division University of California, Berkeley. Parallel Program. Parallel Program. Parallel Program. Run-Time Framework. ° ° °. RunTime. RunTime. RunTime. Machine Architecture. Machine

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Architectural Interactions in High Performance Clusters' - loman


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Architectural interactions in high performance clusters

Architectural Interactions in High Performance Clusters

RTPP 98

David E. Culler

Computer Science Division

University of California, Berkeley


Run time framework

Parallel

Program

Parallel

Program

Parallel

Program

Run-Time Framework

° ° °

RunTime

RunTime

RunTime

Machine

Architecture

Machine

Architecture

Machine

Architecture

Network


Two example runtime layers
Two Example RunTime Layers

  • Split-C

    • thin global address space abstraction over Active Messages

    • get, put, read, write

  • MPI

    • thicker message passing abstraction over Active Messages

    • send, receive


Split c over active messages
Split-C over Active Messages

  • Read, Write, Get, Put built on small Active Message request / reply (RPC)

  • Bulk-Transfer (store & get)

Request

handler

Reply

handler


Model framework logp
Model Framework: LogP

P ( processors )

  • Latency in sending a (small) message between modules

  • overhead felt by the processor on sending or receiving msg

  • gap between successive sends or receives (1/rate)

  • Processors

P

M

P

M

P

M

° ° °

o

o (overhead)

g (gap)

L (latency)

Limited Volume

Interconnection Network

( L/g to a proc)

Round Trip time: 2 x ( 2o + L)



Methodology
Methodology

  • Apparatus:

    • 35 Ultra 170s (64 MB, .5 MB L2, Solaris 2.5)

    • M2F Lanai + Myricom Net in Fat Tree variant

    • GAM + Split-C

  • Modify the Active Message layer to inflate L, o, g, or G independently

  • Execute a diverse suite of applications and observe effect

  • Evaluate against natural performance models


Adjusting l o and g and g in situ
Adjusting L, o, and g (and G) in situ

Host Workstation

Host Workstation

AM lib

AM lib

O: stall Ultra

on msg write

O: stall Ultra

on msg read

Lanai

Lanai

L: defer marking

msg as valid until

Rx + L

Myrinet

g: delay Lanai

after msg injection

(after fragment for

bulk transfers)



Applications characteristics
Applications Characteristics

  • Message Frequency

  • Write-based vs. Read-based

  • Short vs. Bulk Messages

  • Synchronization

  • Communication Balance









Modeling effects of overhead
Modeling Effects of Overhead

  • Tpred = Torig + 2 x max #msgs xo

    • request / response

    • proc with most msgs limits overall time

  • Why does this model under-predict?


Modeling effects of gap
Modeling Effects of gap

  • Uniform communication model

    Tpred = Torig , if g < I, average msg interval

    = Torig + m (g - I ), otherwise

  • Bursty Communication

    Tpred = Torig + m g

g








Understanding speedup
Understanding Speedup

SpeedUp(p) = T1 MAXp (Tcompute + Tcomm. + T wait)

Tcompute = (work/p + extra) x efficiency


Performance tools for clusters
Performance Tools for Clusters

  • Independent data collection on every node

    • Timing

    • Sampling

    • Tracing

  • Little perturbation of global effects




Constant problem size scaling
Constant Problem Size Scaling

4

8

16

32

64

128

256


Communication scaling
Communication Scaling

Normalized Msgs per Proc

Average Message Size




Cache working sets lu
Cache Working Sets: LU

8-fold reduction

in miss rate from

4 to 8 proc




Mpi internal protocol
MPI Internal Protocol

Sender

Receiver


Revised protocol
Revised Protocol

Sender

Receiver



Conclusions
Conclusions

  • Run Time systems for Parallel Programs must deal with a host of architectural interactions

    • communication

    • computation

    • memory system

  • Build a performance model of you RTPP

    • only way to recognize anomalies

  • Build tools along with the RT to reflect characteristics and sensitivity back to PP

  • Much can lurk beneath a perfect speedup curve


ad