1 / 17

War of the Worlds -- Shared-memory vs. Distributed-memory

War of the Worlds -- Shared-memory vs. Distributed-memory. In distributed world, we have heavyweight processes (nodes) rather than threads Nodes communicate by exchanging messages We do not have shared memory Communication is much more expensive

Download Presentation

War of the Worlds -- Shared-memory vs. Distributed-memory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. War of the Worlds -- Shared-memory vs. Distributed-memory • In distributed world, we have heavyweight processes (nodes) rather than threads • Nodes communicate by exchanging messages • We do not have shared memory • Communication is much more expensive • Sending a message takes much more time than sending data through a channel • Possibly non-uniform communication • We only have 1-to-1 communication (no many-to-many channels)

  2. Initial Distributed-memory Settings • We consider settings where there is no multithreading within a single MPI node • We consider systems where communication latency between different nodes is • Low • Uniform

  3. Good Shared Memory Orbit Version Hash Server Thread 1 Hash Server Thread 2 Hash Server Thread 3 {z1,z4,z5,…, zl1} {z2,z3,z8,…, zl2} {z6,z7,z9,…, zl3} O1 O2 O3 [x1,x2, …,xm] Shared Task Pool f1 f1 f1 f1 f2 f2 f2 f2 f3 f3 f3 f3 WorkerThreads f4 f4 f4 f4 f5 f5 f5 f5

  4. Why is this version hard to port to MPI? • Singe task pool! • Requires a shared structure to which all of the hash servers write data, and all of the workers read data from • Not easy to implement using MPI, where we only have 1-to-1 communication • We could have a dedicated node which will hold task queue • Workers send messages to it to request work • Hash servers send messages to it to push work • This would make the node potential bottleneck, and would involve a lot of communication

  5. MPI Version 1 • Maybe merge workers and hash servers? • Each MPI node acts both as a hash server and as a worker • Each node has its own task pool • If task pool of a node is empty, the node tries to steal work from some other node

  6. MPI Version 1 {z1,z4,z5,…, zl1} {z2,z3,z8,…, zl2} {z6,z7,z9,…, zl3} f1 f1 f1 f2 f2 f2 f3 f3 f3 f4 f4 f4 f5 f5 f5 [x11,x12,…x1m1] [x21,x22,…x2m2] [x31,x32,…x3m3] MPI Nodes

  7. MPI Version 1 is Bad! • Bad performance, especially for smaller number of nodes • Same process does hash table lookups, and applies generator functions to points • It cannot do both at the same time => something has to wait • This creates contention

  8. MPI Version 2 • Separate hash servers and workers, after all • Hash server nodes • Keep parts of the hash table • Also keep parts of task pool • Worker nodes just apply generators to points • Workers obtain work from hash server nodes using work-stealing

  9. MPI Version 2 Hash Server nodes {z1,z4,z5,…, zl1} {z2,z3,z8,…, zl2} {z6,z7,z9,…, zl3} O2 O3 O1 [x11,x12,…x1m1] [x21,x22,…x2m2] [x31,x32,…x3m3] T1 T2 T3 f1 f1 f1 f1 f2 f2 f2 f2 Workernodes f3 f3 f3 f3 f4 f4 f4 f4 f5 f5 f5 f5

  10. MPI Version 2 • Much better performance than MPI Version 1 (on low-latency systems) • Key thing is separating hash lookup and applying generators to points in different nodes

  11. Big Issue with MPI Versions 1 and 2 -- Detecting Termination! • We need to detect the situation where all of the hash server nodes have empty task pools, and where no new work will be produced by hash servers! • Even detecting that all task pools are empty and all hash servers and all workers are idle is not enough, as there may be messages flying around that will create more work! • Woe unto me! What are we to do? • Good ol’ Dijkstra comes to rescue - We use a variant of Dijkstra-Scholten Termination Detection Algorithm

  12. Termination Detection Algorithm • Each hash server keeps two counters • Number of points sent (my_nr_points_sent) • Number of points received (my_nr_points_rcvd) • We enumerate hash servers - H0 … Hn • Hash server H0, when idle, sends a token to the hash server H1 • It attaches a token count (my_nr_points_sent, my_nr_points_rcvd) to the token • When a hash server Hi receives a token • If it is active (has tasks in the task pool), sends the token back to H0 • If it is idle, it increases each component of the count attached to the token and sends the token to Hi+1 • If received token count was (pts_sent, pts_rcvd), the new token count is (my_nr_points_sent + pts_sent, my_nr_points_rcvd + pts_rcvd) • If H0 receives the token, and if token count is (pts_sent, pts_rcvd) such that pts_rcvd = num_gens * pts_sent, then termination is detected

  13. MPIGAP Code for MPI Version 2 • Not trivial (~400 lines of GAP code) • Explicit message passing using low-level MPI bindings • This version is hard to implement using task abstraction

  14. MPIGAP Code for MPI Version 2 Worker := function(gens,op,f) local g,j,n,m,res,t,x,toSend,idle; n := nrHashes; while true do t := GetWork(); if IsIdenticalObj (t, fail) then return; fi; m := QuoInt(Length(t)*Length(gens)*2,n); res := List([1..n],x->EmptyPlist(m)); for j in [1..Length(t)] do for g in gens do x := op(t[j],g); Add(res[f(x)],x); od; od; for j in [1..n] do if Length(res[j]) > 0 then OrbSendMessage(res[j],minHashId+j-1); fi; od; od; end;

  15. MPIGAP Code for MPI Version 2 GetWork := function() local msg, target; tid := minHashId; OrbSendMessage(["getwork",processId],tid); msg := OrbGetMessage(true); if msg[1]<>"finish" then return msg; else return fail; fi; end;

  16. MPIGAP Code for MPI Version 2 OrbGetMessage := function(blocking) local test, msg, tmp, veg; if blocking then test := MPI_Probe(); else test := MPI_Iprobe(); fi; if test then msg := UNIX_MakeString(MPI_Get_count()); MPI_Recv(msg); tmp := DeserializeNativeString(msg); totalProcTime := totalProcTime + veg; else return fail; fi; end; OrbSendMessage := function(raw,dest) local msg; msg := SerializeToNativeString(raw); MPI_Binsend(msg,dest,Length(msg)); end;

  17. Work in Progress - Extending MPI Version 2 To Systems With Non-Uniform Latency • Communication latencies between nodes might be different • Where to place hash server nodes? And how many? • How to do work distribution? • Is work stealing still a good idea in a setting where communication distance between a worker and different hash servers is not uniform? • We can look at the Shared memory + MPI world as a special case of this • Multithreading within MPI nodes • Threads from the same node can communicate fast • Nodes communicate much slower

More Related