1 / 11

LogP Model

LogP Model. Motivation BSP Model Limited to BW of Network (g) and Load of PE Requires large load per super steps. Need Better Models for Portable Algorithms Converging Hardware Independent from Network Topology Programming Models Assumption

jatin
Download Presentation

LogP Model

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LogP Model Motivation • BSP Model Limited to BW of Network (g) and Load of PE Requires large load per super steps. • Need Better Models for Portable Algorithms • Converging Hardware • Independent from Network Topology • Programming Models • Assumption • Number of PE much bigger than data elements

  2. Parameters • L: Latency • delay on the network • o: Overhead on PE • g: gap • minimum interval between consecutive messages (due to bandwidth) • P: Number of PEs Note: L,o,g : independent from P or node distances Message length: short message L,o,g are per word or per message of fixed length k word message: k short messages (k*o overhead) L independent from message length

  3. Parameters (continue) • Bandwidth: 1/g * unit message length • Number of messages to send or receive for each PE: L/g • Send to Receive total time : L+2o • if o >> g, ignore o • Similar to BSP except no synchronization step • No communication computation overlapping • Speed-up factor at most two

  4. P0 P5 P2 P6 P7 g p0 o L p1 Broadcast Optimal Broad cast tree 0 P1 P3 P4 10 14 18 22 20 24 24 P=8, L=6, g=4, o=2

  5. Optimal Sum • Given time T, how many items we can add? • Approach: recursive • At root, if T <= L+2o use a single PE (can add T+1 items) • If T > L+2o, • Root should have data ready at T, • and sender must have sum ready at T - L - 2o - 1 • Recursively construct the sum tree at the sender • If T - g > L+2o, Root also can receive data, and compute the sum with T-g as the root.

  6. Applications FFT on the Butterfly network • Data Placement • cyclic layout - First log n/P local comm, last log P global • blocked layout - First log P global comm, remaining local • hybrid: After log (n/P) iteration, re-map to cyclic so that remaining can be also local Communication time: g* (n/P**2) (P-1) + L each PE has n/P data, each of 1/P goes to each other PE Total time is (1+g/logn) optimal • All to all Communication schedule • Approach 1: each PE sends PE1, PE2, … => bottle neck at PE1, PE2 in this order • Approach 2 (staggered re-map) -- no congestion • PE1 sends PE2, PE3,.. • PE2 sends PE3, PE4, etc

  7. Implementation on CM5 • CM: • 33MHz • Fat Trees • Global Control for scan/prefix/broadcast • one CM-5 3.2 MFLOPs • FFT on local: 2.8 - 2.2 MFLOPs (cache effect) • each cycle: • multiply and add : 4.5 us • o: 2us • L: 6us • g: 4us • load ans store overhaed per cycle 1us • communication time : n/P max (1us + 2o, g) + L • bottleneck: processing and overhead, not bw

  8. LU decomposition • Data arrangement critical

  9. Matching machine with real machines Average Distance topology independent usually works for n=1024 nodes. The difference between average distance and max distance are not such different

  10. Potential Concerns • Algorithmic concern • Theory? • Too complex? • Communication concerns • how to use trivial comm such as local exchange • topology dependencies?

  11. Comparison with BSP • Length of superstep • message not usable till next step • special hardware for sync • virtual/physical large, context switching may be expensive

More Related