architectural interactions in high performance clusters n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Architectural Interactions in High Performance Clusters PowerPoint Presentation
Download Presentation
Architectural Interactions in High Performance Clusters

Loading in 2 Seconds...

play fullscreen
1 / 40

Architectural Interactions in High Performance Clusters - PowerPoint PPT Presentation


  • 109 Views
  • Uploaded on

Architectural Interactions in High Performance Clusters. RTPP 98 David E. Culler Computer Science Division University of California, Berkeley. Parallel Program. Parallel Program. Parallel Program. Run-Time Framework. ° ° °. RunTime. RunTime. RunTime. Machine Architecture. Machine

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Architectural Interactions in High Performance Clusters' - loman


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
architectural interactions in high performance clusters

Architectural Interactions in High Performance Clusters

RTPP 98

David E. Culler

Computer Science Division

University of California, Berkeley

run time framework

Parallel

Program

Parallel

Program

Parallel

Program

Run-Time Framework

° ° °

RunTime

RunTime

RunTime

Machine

Architecture

Machine

Architecture

Machine

Architecture

Network

two example runtime layers
Two Example RunTime Layers
  • Split-C
    • thin global address space abstraction over Active Messages
    • get, put, read, write
  • MPI
    • thicker message passing abstraction over Active Messages
    • send, receive
split c over active messages
Split-C over Active Messages
  • Read, Write, Get, Put built on small Active Message request / reply (RPC)
  • Bulk-Transfer (store & get)

Request

handler

Reply

handler

model framework logp
Model Framework: LogP

P ( processors )

  • Latency in sending a (small) message between modules
  • overhead felt by the processor on sending or receiving msg
  • gap between successive sends or receives (1/rate)
  • Processors

P

M

P

M

P

M

° ° °

o

o (overhead)

g (gap)

L (latency)

Limited Volume

Interconnection Network

( L/g to a proc)

Round Trip time: 2 x ( 2o + L)

methodology
Methodology
  • Apparatus:
    • 35 Ultra 170s (64 MB, .5 MB L2, Solaris 2.5)
    • M2F Lanai + Myricom Net in Fat Tree variant
    • GAM + Split-C
  • Modify the Active Message layer to inflate L, o, g, or G independently
  • Execute a diverse suite of applications and observe effect
  • Evaluate against natural performance models
adjusting l o and g and g in situ
Adjusting L, o, and g (and G) in situ

Host Workstation

Host Workstation

AM lib

AM lib

O: stall Ultra

on msg write

O: stall Ultra

on msg read

Lanai

Lanai

L: defer marking

msg as valid until

Rx + L

Myrinet

g: delay Lanai

after msg injection

(after fragment for

bulk transfers)

applications characteristics
Applications Characteristics
  • Message Frequency
  • Write-based vs. Read-based
  • Short vs. Bulk Messages
  • Synchronization
  • Communication Balance
modeling effects of overhead
Modeling Effects of Overhead
  • Tpred = Torig + 2 x max #msgs xo
    • request / response
    • proc with most msgs limits overall time
  • Why does this model under-predict?
modeling effects of gap
Modeling Effects of gap
  • Uniform communication model

Tpred = Torig , if g < I, average msg interval

= Torig + m (g - I ), otherwise

  • Bursty Communication

Tpred = Torig + m g

g

understanding speedup
Understanding Speedup

SpeedUp(p) = T1 MAXp (Tcompute + Tcomm. + T wait)

Tcompute = (work/p + extra) x efficiency

performance tools for clusters
Performance Tools for Clusters
  • Independent data collection on every node
    • Timing
    • Sampling
    • Tracing
  • Little perturbation of global effects
communication scaling
Communication Scaling

Normalized Msgs per Proc

Average Message Size

cache working sets lu
Cache Working Sets: LU

8-fold reduction

in miss rate from

4 to 8 proc

mpi internal protocol
MPI Internal Protocol

Sender

Receiver

revised protocol
Revised Protocol

Sender

Receiver

conclusions
Conclusions
  • Run Time systems for Parallel Programs must deal with a host of architectural interactions
    • communication
    • computation
    • memory system
  • Build a performance model of you RTPP
    • only way to recognize anomalies
  • Build tools along with the RT to reflect characteristics and sensitivity back to PP
  • Much can lurk beneath a perfect speedup curve