logp model n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
LogP Model PowerPoint Presentation
Download Presentation
LogP Model

Loading in 2 Seconds...

play fullscreen
1 / 11

LogP Model - PowerPoint PPT Presentation


  • 92 Views
  • Uploaded on

LogP Model. Motivation BSP Model Limited to BW of Network (g) and Load of PE Requires large load per super steps. Need Better Models for Portable Algorithms Converging Hardware Independent from Network Topology Programming Models Assumption

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

LogP Model


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
logp model
LogP Model

Motivation

  • BSP Model

Limited to BW of Network (g) and Load of PE

Requires large load per super steps.

  • Need Better Models for Portable Algorithms
  • Converging Hardware
    • Independent from Network Topology
    • Programming Models
  • Assumption
    • Number of PE much bigger than data elements
parameters
Parameters
  • L: Latency
    • delay on the network
  • o: Overhead on PE
  • g: gap
    • minimum interval between consecutive messages (due to bandwidth)
  • P: Number of PEs

Note:

L,o,g : independent from P or node distances

Message length: short message

L,o,g are per word or per message of fixed length

k word message: k short messages (k*o overhead)

L independent from message length

parameters continue
Parameters (continue)
  • Bandwidth: 1/g * unit message length
  • Number of messages to send or receive for each PE: L/g
  • Send to Receive total time : L+2o
  • if o >> g, ignore o
    • Similar to BSP except no synchronization step
    • No communication computation overlapping
    • Speed-up factor at most two
broadcast

P0

P5

P2

P6

P7

g

p0

o

L

p1

Broadcast

Optimal Broad cast tree

0

P1

P3

P4

10

14

18

22

20

24

24

P=8, L=6, g=4, o=2

optimal sum
Optimal Sum
  • Given time T, how many items we can add?
  • Approach: recursive
    • At root, if T <= L+2o use a single PE (can add T+1 items)
    • If T > L+2o,
      • Root should have data ready at T,
      • and sender must have sum ready at T - L - 2o - 1
      • Recursively construct the sum tree at the sender
      • If T - g > L+2o, Root also can receive data, and compute the

sum with T-g as the root.

applications
Applications

FFT on the Butterfly network

  • Data Placement
    • cyclic layout - First log n/P local comm, last log P global
    • blocked layout - First log P global comm, remaining local
    • hybrid: After log (n/P) iteration, re-map to cyclic so that

remaining can be also local

Communication time: g* (n/P**2) (P-1) + L

each PE has n/P data, each of 1/P goes to each other PE

Total time is (1+g/logn) optimal

  • All to all Communication schedule
    • Approach 1: each PE sends PE1, PE2, … => bottle neck at PE1, PE2 in this order
    • Approach 2 (staggered re-map) -- no congestion
      • PE1 sends PE2, PE3,..
      • PE2 sends PE3, PE4, etc
implementation on cm5
Implementation on CM5
  • CM:
    • 33MHz
    • Fat Trees
    • Global Control for scan/prefix/broadcast
    • one CM-5 3.2 MFLOPs
    • FFT on local: 2.8 - 2.2 MFLOPs (cache effect)
  • each cycle:
    • multiply and add : 4.5 us
    • o: 2us
    • L: 6us
    • g: 4us
    • load ans store overhaed per cycle 1us
  • communication time : n/P max (1us + 2o, g) + L
  • bottleneck: processing and overhead, not bw
lu decomposition
LU decomposition
  • Data arrangement critical
matching machine with real machines
Matching machine with real machines

Average Distance

topology independent usually works for n=1024 nodes.

The difference between average distance and max distance are not such different

potential concerns
Potential Concerns
  • Algorithmic concern
    • Theory?
    • Too complex?
  • Communication concerns
    • how to use trivial comm such as local exchange
    • topology dependencies?
comparison with bsp
Comparison with BSP
  • Length of superstep
  • message not usable till next step
  • special hardware for sync
  • virtual/physical large, context switching may be expensive