1 / 46

CS 584

CS 584. Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing Static Dynamic. Load Balancing. The load balancing problem can be reduced to the bin-packing problem NP-complete

alamea
Download Presentation

CS 584

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 584

  2. Load Balancing • Goal: All processors working all the time • Efficiency of 1 • Distribute the load (work) to meet the goal • Two types of load balancing • Static • Dynamic

  3. Load Balancing • The load balancing problem can be reduced to the bin-packing problem • NP-complete • For simple cases, we can do well, but … • Heterogeneity • Different types of resources • Processor • Network, etc.

  4. Evaluation of load balancing • Efficiency • Are the processors always working? • How much processing overhead is associated with the load balance algorithm? • Communication • Does load balance introduce or affect the communication pattern? • How much communication overhead is associated with the load balance algorithm? • How many edges are cut in communication graph?

  5. Partitioning Techniques • Regular grids (-: Easy :-) • striping • blocking • use processing power to divide load more fairly • Generalized Graphs • Levelization • Scattered Decomposition • Recursive Bisection

  6. Levelization • Begin with a boundary • Number these nodes 1 • All nodes connected to a level 1 node are labeled 2, etc. • Partitioning is performed • determine the number of nodes per processor • count off the nodes of a level until exhausted • proceed to the next level

  7. Levelization

  8. Levelization • Want to insure nearest neighbor comm. • If p is # processors and n is # nodes. • Let ribe the sum of the number of nodes in contiguous levels i and i + 1 • Let r = max{r1, r2, … , rn} • Nearest neighbor communication is assured if n/p > r

  9. Scattered Decomposition • Used for highly irregular grids • Partition load into a large number r of rectangular clusters such that r >> p • Each processor is given a disjoint set of r/p clusters. • Communication overhead can be a problem for highly irregular problems.

  10. Recursive Bisection • Recursively divide the domain in two pieces at each step. • 3 Methods • Recursive Coordinate Bisection • Recursive Graph Bisection • Recursive Spectral Bisection

  11. Recursive Coordinate Bisection • Divide the domain based on the physical coordinates of the nodes. • Pick a dimension and divide in half. • RCB uses no connectivity information • lots of edges crossing boundaries • partitions may be disconnected • Some new research based on graph separators overcomes some problems.

  12. Ineritial Bisection • Often, coordinate bisection is susceptible to the orientation of the mesh • Solution: Find the principle axis of the communication graph

  13. Graph Theory Based Algorithms • Geometric algorithms are generally low quality • they don’t take into account connectivity • Graph theory algorithms apply what we know about generalized graphs to the partitioning problem • Hopefully, they reduce the cut size

  14. Greedy Bisection • Start with a vertex of the smallest degree • least number of edges • Mark all its neighbors • Mark all its neighbors neighbors, etc. • The first n/p marked vertices form one subdomain • Apply the algorithm on the remaining

  15. Recursive Graph Bisection • Based on graph distance rather than coordinate distance. • Determine the two furthest separated nodes • Organize and partition nodes according to their distance from extremities. • Computationally expensive • Can use approximation methods.

  16. Recursive Spectral Bisection • Uses the discrete Laplacian • Let A be the adjacency matrix • Let D be the diagonal matrix where • D[i,i] is the degree of node I • LG = A - D

  17. Recursive Spectral Bisection • LG is negative semidefinite • Its largest eigenvalue is zero and the corresponding eigenvector is all ones. • The magnitude of the second largest eigenvalue gives a measure of the connectivity of the graph. • Its corresponding eigenvector gives a measure of distances between nodes.

  18. Recursive Spectral Bisection • The eigenvector corresponding to the second largest eigenvalue is the Fiedler vector. • Calculation of the Fiedler vector is computationally intensive. • RSB yields connected partitions that are very well balanced.

  19. Example

  20. RSB 299 edges cut RCB 529 edges cut RGB 618 edges cut

  21. Global vs Local Partitioning • Global methods produce a “good” partitioning • Local methods can then be used to improve the partitioning

  22. The Kernighan-Lin algorithm • Swap pairs of nodes to decrease the cut • Will allow intermediate increases in the cut size to avoid certain local minima • Loop • choose the pair of nodes with largest benefit of swapping • logically exchange them (not for real) • lock those nodes • until all nodes are locked • Find the sequence of swaps that yields the largest accumulated benefit • Perform the swaps for real

  23. The Kernihan-Lin Algorithm

  24. Helpful-Sets • Two Steps • Find a set of nodes in one partition and move it to the other partition to decrease the cut size • Rebalance the load • The set of nodes moved must be helpful • Helpfulness of node is equal to the change in cut size if the node is moved

  25. Helpful-Sets All these sets are 2 - helpful

  26. Helpful-Sets Algorithm

  27. The Helpful-Sets Algorithm • Theory • If there is a bisection and if its cut size is not “too small” then there exists a small 4-helpful set in one side or the other • This 4-helpful set can be moved and will reduce the cut by 4 • If imbalance is not “too large” and cut of unbalanced partition is not “too small” then it is possible to rebalance without increasing the cut size by more than 2 • Apply the theory iteratively until “too small” condition is met.

  28. Multi-level Hybrid Methods • For very large graphs, time to partition can be extremely costly • Reduce time by coarsening the graph • shrink a large graph to a smaller one that has similar characteristics • Coarsen by • heavy edge matching • simple partitioning heuristics

  29. Multi-level Hybrid Methods

  30. (x.xx) – run time in seconds ML – Multilevel (spectral on coarse – KL on intermediate) IN – Inertial Party – 5 or 6 different methods Comparisons Chaco Metis Party Graph |v| |e| ML IN IN+KL PMetis all all+HS airfoil 4253 12289 85 94 83 85 94 83 (0.08) (0.00) (0.02) (0.04) (0.04) (0.15) crack 10240 30380 211 377 218 196 243 208 (0.16) (0.01) (0.05) (0.14) (0.10) (0.44) wave 156317 1059331 9542 9834 9660 9801 10361 9614 (3.64) (0.19) (1.61) (3.50) (2.84) (11.93) lh 1443 20148 36376 22579 13643 9897 total edge weight 487380 (0.33) (0.06) (0.06) (0.23) mat 73752 1761718 9359 9555 8869 8869 (1.80) (2.04) (3.45) (11.52) DEBR 1048576 2097149 100286 101674 172204 94272 (48.99) (988.39) (16.63) (577.97)

  31. Dynamic Load Balancing • Load is statically partitioned initially • Adjust load when an imbalance is detected. • Objectives • rebalance the load • keep edge cut minimized (communication) • avoid having too much overhead

  32. Dynamic Load Balancing • Consider adaptive algorithms • After an interval of computation • mesh is adjusted according to an estimate of the discretization error • coarsened in areas • refined in others • Mesh adjustment causes load imbalance

  33. Dynamic Load Balancing After refinement, node 1 ends up with more work

  34. Centralized DLB • Control of the load is centralized • Two approaches • Master-worker (Task scheduling) • Tasks are kept in central location • Workers ask for tasks • Requires that you have lots of tasks with weak locality requirements. No major communication between workers • Load Monitor • Periodically, monitor load on the processors • Adjust load to keep optimal balance

  35. Repartitioning • Consider: dynamic situation is simply a sequence of static situations • Solution: repartition the load after each • some partitioning algorithms are very quick • Issues • scalability problems • how different are current load distribution and new load distribution • data dependencies

  36. Decentralizing DLB • Generally focused on work pool • Two approaches • Hierarchy • Fully distributed

  37. Fully Distributed DLB • Lower overhead than centralized schemes. • No global information • Load is locally optimized • Propagation is slow • Load balance may not be as good as centralized load balance scheme • Three steps • Flow calculation (How much to move) • Mesh node selection (Which work to move) • Actual mesh node migration

  38. Flow calculation • View as a network flow problem • Add source and sink nodes • Connect source to all nodes • edge value is current load • Connect sink to all nodes • edge value is mean load processor communication graph

  39. Flow calculation • Many network flow algorithms • more intense than necessary • not parallel • Use simpler, more scalable algorithms • Random Matchings • pick random neighboring processes • exchange some load • eventually you may get there

  40. Diffusion • Each processor balances its load with all its neighbors • How much work should I have? • How much to send on an edge? • Repeat until all load is balanced steps

  41. Diffusion • Convergence to load balance can be slow • Can be improved with over-relaxation • Monitor what is sent in each step • Determine how much to send based on current imbalance and how much was sent in previous steps • Diffuses load in steps

  42. Dimension Exchange • Rather than communicate with all neighbors each round, only communicate with one • Comes from dimensions of hypercube • Use edge coloring for general graphs • Exchange load with neighbor along a dimension • l = (li + lj)/2 • Will converge in d steps if hypercube • Some graphs may need different factor to converge faster • l = li * a + lj * (1 –a)

  43. Diffusion & Dimension Exchange • Can view • diffusion as a Jacobi method • dimension exchange as Gauss-Seidel • Can use multi-level variants • Divide the processor communication graph in half • Determine the load to shift across the cut • Recursively rebalance each half

  44. Mesh node selection • Must identify which mesh nodes to migrate • minimize edge cut and overhead • Very dependent on problem • Shape & size of partition may play a role in accuracy • Aspect ratio maintenance • Move items that are further away from center of gravity.

  45. Load Balancing Schemes(Who do I request work from?) • Asynchronous Round Robin • each processor maintains target • Ask from target then increment target • Global Round Robin • target is maintained by master node • Random Polling • randomly select a donor • each processor has equal probability

More Related