Lecture 11: Parallel Processing of Irregular Computations & Load Balancing

Lecture 11: Parallel Processing of Irregular Computations & Load Balancing ShantanuDutt ECE Dept. UIC

Discrete Event Simulation—Basics with VHDL Descriptions as an Example. VHDL Dataflow Description of a Circuit: Library IEEE; use IEEE.STD_LOGIC_1164.all; entity ckt1 is port(s1,s2:in bit; Z:out bit); end entity ckt1; architecture data_flow of ckt1 is signal sbar1,sbar2,x,y:bit; begin sbar1 <= not s1 after 2 ns; sbar2 <= not s2 after 2 ns; x <= s1 and sbar2 after 4 ns; y <= s2 and sbar1 after 4 ns; Z <= x or y after 4 ns; end architecture data_flow;

Discrete Event Simulation—Basics

Discrete Event Simulation—Basics (cont’d)

Parallel DES for Logic Simulation

Correctness Issues in Parallel DES • What happens is inter-processor messages are received out of simulation time order, either from the same processor or from different processors? In other words, if a msg. w/ simulation time ti is received before a msg. w/ simulation time tj, where ti > tj, then what happens? The sim. time ti and tjmsgs. could be coming from the same or different processors • If a proc. “blindly” processes all msgs. as they come, then this can lead to incorrect simulation. E.g., the sim. time tj msg. can cause an output that affects the input to the process for the sim. time ti msg. in the above example. So if the earlier arriving sim. time ti msg. is processed before the later arriving sim. time tj msg., the former simulation output will likely be incorrect.

Correctness Issues in Parallel DES: Solutions • For each msg. sent from processor Pk targeting a (simulation) process Qr (which is, say, in processor Pq), Pk records the sim. time tq of the latest such msg. When sending the next msg. targeting Qr, Pk also mentions the prev. sim. time along w/ the current one tj. • So the msg. data looks like Mj = (input value, tj [curr. sim. time], tq [prev. sim. time]) • The next msg. Mi = (input value, ti, tj) • Receiving proc. Pq also records the sim. time tq of the last msg. received for each input of Qr. If a new msg. meant for that input of Qr has the prev. sim. time the same as that it has recorded, then that msg. is correct in terms of timing order. Otherwise, Pq will store the msg. but wait for a previous msg. of correct timing order that it has not yet recvd. • So if for the input, say, A, of Qr, the recorded time of the prev. simulation is tq, and msg. Mi =(value, ti, tj) is recvd. it will not be processed. Only after msg. Mj =(value, tj, tq) is recvd. it will be processed, followed by the processing of msg. Mi (since the latest recorded sim. time for i/p A of Qr is tj). • With regards to msg. from multiple processors, Pq will not perform any simulation until it has recvd. timing-correct msgs (e.g., Mj above) from all procs. supposed to send it msgs. This issue underscores the imp. of null msgs. w/o which simulation will not proceed further in this aforementioned approach

Some examples of applications requiring DES

A A A B B B C C C E E E G G G D D D F F F 3 1 1 Search Techniques 5 9 4 6 6 2 2 8 3 10 7 7 4 Soln found (A,B,E,C,F) thatmeets some criterion 5 Graph BFS DFS (black arcs) and Soln_DFS (black+red arcs) soln_dfs(v) /* used when nodes are basic elts of the problem and not partial solnnodes, and a soln. is a path */ v.mark = 1; If path to v is a soln, then return(1); for each (v,u) in E if (u.mark != 1) then soln_found = soln_dfs(u) if (soln_found = 1) then return(soln_found) end for; v.mark = 0; /* can visit v again to form another soln on a different path */ return(0) dfs(v) /* for basic graph visit or for soln finding when nodes are partial or full solns*/ v.mark = 1; for each (v,u) in E if (u.mark != 1) then dfs(u) Algorithm Depth_First_Search_Soln for each v in V v.mark = 0; if G has partial soln nodes then for each v in V if v.mark = 0 then dfs(v); end for; else soln_dfs(root); /* root is a particular node in V from were we can start the solution search */

A B C E G D F Search Techniques—Exhaustive DFS optimal_soln_dfs(v) /* used when nodes are basic elts of the problem and not partial solnnodes, and a soln. is a path */ begin v.mark = 1; If path to v is a soln, then begin if cost < best_cost then begin best_soln=soln; best_cost=cost; endif v.mark=0; return; Endif for each (v,u) in E if (u.mark != 1) then cost = cost + edge_cost(v,u); /* global var. */ optimal_soln_dfs(u) end for; v.mark = 0; /* can visit v again to form another soln on a different path */ end 1 i > 10 i+1 9 6 Best soln. so far (A,C,E,D,F,G) 2 i+2 8 3 10 7 4 5 i+3 i+4 Soln found (A,B,E,C,F) DFS (black arcs) and Soln_DFS (black+red arcs) Optimal_Soln_DFS (black+red+green) arcs Algorithm Depth_First_Search_Opt_Soln for each v in V v.mark = 0; best_cost = infinity; cost = 0; optimal_soln_dfs(root);

Y = partial soln. = a path from root to current “node” (a basic elt. of the problem, e.g., a city in TSP, a vertex in V0 or V1 in min-cut partitioning). We go from each such “node” u to the next one u that is “reachable “ from u in the problem “graph” (which is part of what you have to formulate) root Best-First Search u 10 BeFS (root) begin open = {root} /* open is list of gen. but not expanded nodes—partial solns */ best_soln_cost = infinity; while open != nullset do begin curr = first(open); if curr is a soln then return(curr) /* curris an optimal soln */ else children = Expand_&_est_cost(curr); /* generate all children of curr & estimate their costs---cost(u) should be a lower bound of cost of the best soln reachable from u */ for each child in children do begin if child is a soln then delete all nodes w in open s.t. cost(w) >= cost(child); endif store child in open in increasing order of cost; endfor endwhile end /* BFS */ costs (1) 12 15 19 (2) 16 18 18 17 (3) Expand_&_est_cost(Y) begin children = nullset; for each basic elt x of problem “reachable” from Y & can be part of current partial soln. Y do begin if x not in Y and if feasible child = Y U {x}; path_cost(child) = path_cost(Y) + cost(Y, x) /* cost(Y, x) is cost of reaching x from Y */ est(child) = lower bound cost of best soln reachable from child; cost(child) = path_cost(child) + est(child); children = children U {child}; endfor end /* Expand_&_est_cost(Y);

root • Proof of optimality when cost is a LB • The current set of nodes in “open” represents a complete front of generated nodes, i.e., the rest of the nodes in the search space are descendants of “open” • Assuming the basic cost (cost of adding an elt in a partial soln to contruct another partial soln that is closer to the soln) is non-negative, the cost is monotonic, i.e., cost of child >= cost of parent • If first node curr in “open” is a soln, then cost(curr) <= cost(w) for each w in “open” • Cost of any node in the search space not in “open” and not yet generated is >= cost of its ancestor in “open” and thus >= cost(curr). Thus curr is the optimal (min-cost) soln Best-First Search u 10 Y = partial soln. costs (1) 12 15 19 (2) 16 18 18 17 (3)

9 A B 5 4 3 5 8 5 F C E 7 1 2 D A A Search techs for a TSP example E B F C F D F D F E E TSP graph E F E D x A A A 27 31 33 Solution nodes Exhaustive search using DFS (w/ backtrack) for finding an optimal solution

A 9 B 5 4 3 5 8 5 F C E 7 1 C E 2 D A A Search techs for a TSP example (contd) E B F 5+15 D C F 8+16 F 21+6 D F C F C D E 11+14 22+9 D B F F E 14+9 23+8 X X X F F Path cost for (A,E,F) = 8 A A 27 20 MST for node (A, E, F); = MST{F,A,B,C,D}; cost=16 BeFS for finding an optimal TSP solution • Lower-bound cost estimate: • MST({unvisited cities} U • {current city} U {start city}) • LB as structure (spanning tree) • is a superset of reqdsoln structure • (cycle) • min(metric M’s values in set S) • <= min(M’s values in subset S’) • Similarly for max?? Set S of all spanning trees in a graph G S S’ Set S’of all Hamiltonian paths (that visits a node exactly once)in a graph G

X = {x1, …, xm} are 0/1 vars • Choose vars Xi=0/1 as next nodes in some order (random or heuristic based) Cost relations: C5 < C3 < C1 < C6 C2 < C1 C4 < C3 BFS for 0/1 ILP Solution root (no vars exp.) X2=1 X2=0 Solve LP w/ x2=1; Cost=cost(LP)=C2 Solve LP w/ x2=0; Cost=cost(LP)=C1 X4=0 X4=1 Solve LP w/ x2=1, x4=1; Cost=cost(LP)=C4 Solve LP w/ x2=1, x4=0; Cost=cost(LP)=C3 X5=0 X5=1 Solve LP w/ x2=1, x4=1, x5=0 Cost=cost(LP)=C5 Solve LP w/ x2=1, x4=1, x5=1 Cost=cost(LP)=C6 optimal soln

(stop when child gen. is a soln. node that is at most (1+alpha)*cost(best(open)), alpha is given sub-opt. fraction.

for constant efficiency

E(P)=Sp(P)/P = T(1)/(Tp(P)*P) = n*texp/(P*(n/P)*(texp+(P-1)*tacc)) • = n*texp/(n*texp+ n*(P/(P-1))*tacc) = const. C <= 1 •  1 + ((P-1)/P)*tacc/texp) = 1/C  ((P-1)/P)*tacc/texp = 1/C – 1 • Differentiating both sides wrt P to minimize the expr. (max. C), we get: • tacc/texp – (tacc/texp )*((P-1)/P2) = 0  P ~ texp/tacc • Note that the 2nd derivative d(tacc/texp – (tacc/texp )*((P-1)/P2))/dP > 0 •  equating the derivative to 0 as above gives a minima and not a maxima.

Nodes w/ cost >= the current best global soln. so far are discarded. Note that this can sometimes lead to idling, and at other times non-essential work can be done before such deletion of nodes take place. Both contribute to the overhead of parallel B&B

Load Balancing Legend: Load info exchange LIE Load/work transfer • Generic Load Balance protocol • Periodic LIEs between subsets of processors (generally, neighbors or small extended neighborhoods, e.g., distance k apart for small k) • Followed by work transfers as indicated by the LIE and work transfer policy

Quality Equalizing (QE) Load Balancing Techniques • Various techniques developed by my former Ph.D. student Prof. NiharMahaptra (MSU) and myself over a few years. The refs are: • N.R. Mahapatra and S. Dutt, ``An efficient delay-optimal distributed termination detection algorithm'',Jour. Parallel and Distr. Computing , Oct. 2007, pp. 1047-1066. • N.R. Mahapatra and S. Dutt, ``Adaptive Quality Equalizing: High-Performance Load Balancing for Parallel Branch-and-Bound Across Applications and Computing Systems'', Proc. Joint IEEE Parallel Processing Symposium/ Symp. on Parallel and Distr. Processing , April 1998. • N.R. Mahapatra and S. Dutt, ``Random Seeking: A General, Efficient, and Informed Randomized Scheme for Dynamic Load Balancing'',Proc. Tenth IEEE Parallel Processing Symposium, April 1996, pp. 881-885. • N.R. Mahapatra and S. Dutt, ``New anticipatory load balancing strategies for scalable parallel best-first search'',American Mathematical Society's DIMACS Series on Discrete Mathematics and Theoretical Computer Science, Vol. 22, 1995, pp. 197-232. S. Dutt and N.R. Mahapatra, ``Scalable load-balancing strategies for parallel A* algorithms'', Special Issue on Scalability of Parallel Algorithms and Architectures Journal of Parallel and Distr. Computing, Vol. 22, No. 3, Sept. 1994, pp. 488-505. • S. Dutt and N.R. Mahapatra, ``Parallel A* algorithms and their performance on hypercube multiprocessors'',Proc. Seventh IEEE Parallel Processing Symposium, 1993, pp. 797-803.

The donor processor grants very few nodes to acceptor • For high-latency low-bw platforms like NOWs (n/w of workstations and Beowulf clusters like Argo): • set s higher (s hould be inversely proportional to bw , otherwise n/w saturation can occur) • decrease frequency of load info exchange (LIE)

Lecture 11: Parallel Processing of Irregular Computations & Load Balancing