1 / 72

CSE 8163

CSE 8163. Parallel & Distributed Scientific Computing Dr. Ioana Banicescu. Parallel algorithms for unstructured & dynamically varying problems. Parallel algorithms for unstructured & dynamically varying problems. Background and related work

Download Presentation

CSE 8163

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. CSE 8163 Parallel & Distributed Scientific Computing Dr. Ioana Banicescu

  2. Parallel algorithms for unstructured & dynamically varying problems

  3. Parallel algorithms for unstructured & dynamically varying problems • Background and related work - motivation (application, model, performance) - steps towards best solution - problem classification - discuss various aspects • Mapping algorithms onto architectures - grid-oriented problems (main issues) - structured grid-oriented problems - unstructured grid-oriented problems • Conclusions and future work

  4. Motivation • Unstructured and dynamic problems - abstractions of various phenomena - astrophysics, molecular biology, chemistry - fluid dynamics, electromagnetics, … - large, computationally intensive • Applications - solutions to boundary integral equations . wave propagation, fluid flow - transitions from 3-D to 2-D - prediction (Schrödinger's 1st, 2nd) - local density approximations - large least-squares problems - image processing, molecular biology, astrophysics

  5. Motivation (continued) • Goal: model real event - simulations, prediction, impact - accurate solutions • Need high performance solutions - performance evaluations . hard to compare (different parameters) . metrics (ExT, Sp, E, IsoE, ConvR) - simple, accurate, fast, effective

  6. Performance of Parallel Algorithms • Theoretical order analysis (not enough) - best solutions restricted applications - nonrealistic assumptions about resources . PE: number, speed; problem size • Empirical testing (optimal solution difficult) - need to consider various factors . [FlattKen89]: given overhead, problem size, parallel execution time minimum for a unique number of processors. studies conclude: convergence rate and parallel efficiency determine granularity and choice of algorithm. . [BarrHick93]: graph (best solution versus time) • Performance degradation (loss degree)

  7. Steps towards best performance solution • Depends upon best choice of - parallel algorithm (simple, accurate, fast) - parallel architecture (fast, effective) - mapping (detect, partitioning, scheduling) • Mapping – goals (minimize factors) - computation time (numerical efficiency) - communication (comm/comp ratio) - load imbalance (effective use of resources) - overhead (synch, comm, sched) . [identify stochastic behavior – algorithm design]

  8. Steps towards best performance solution (continued) • Best performance (choice of parameters) - find dominant component • Mapping – considerations - problems domain distribution (pattern, density, size) - interconnection topology, characteristics of each computation

  9. Problem classification • Pattern of data points distribution - structured (uniform) . synchronous - unstructured (nonuniform) . loosely synchronous . asynchronous • Density of data points - dense (high density) - sparse (low density) • Solutions: grids (math description, data struct) structured, unstructured, semi structured

  10. Problem ClassificationDepending on the pattern of data points distribution

  11. Problem ClassificationDepending on the density of data points distribution

  12. Structured Problems • Synchronous, regular space & time • Uniform distribution data points • Parall easy detect, express, implement • Naturally expressed (vector, matrix) • Compiler: mapp constructs, operations • e.g. QCD simulations, chemistry, …

  13. Unstructured Problems • Loosly synchronous (irreg space,reg & time) • Asynchronous (irregular space & time) • Nonuniform distribution data points • Irregularities difficult: detect, express • Irregularities hard to implement (develop) • Need: flexible HW, fast communication

  14. Loosely Synchronous Problems • Dominant methodology irreg scientific simulation - irregular static, data parallel over sparse structure • Irregularity not too hard: detect, express - hierarchical data struct, sparse matrices - success express irregularity: performance gains • Irregularity hard to implement: - high-level data structure, geometric decomposition • Run better on MIMD than on SIMD

  15. Loosely Synchronous Problems (continued) • Different data points, distinct algorithm • Eg periods interact heterogeneous objects - time-driven simulation, statistical physics, particle dynamics - adaptive meshes, biology, image processing - Monte Carlo, clustering algorithm: N-body problem • Spatial structure may change dynamically - need synchronization each iteration step • Need: adaptive algorithms

  16. Asynchronous • Irregularity hard to detect, express, implement • Hard to parallelize (unless embarrassingly) • Eg event-driven simulation, chess, market analysis • Irregularities dynamic (cannot exploit) • Cannot use simple mappings (comm, decomp) • Object-oriented approach (flexible communication) • Statistical methods load balancing

  17. Grid-oriented problems • Structured grids - simple (each node same procedure) - low overhead - hard to create (complex domains) • Unstructured grids - easy to create, adapt, effective - no need to propagate local features - large overhead, more storage • Semistructured grids - domain: unstructured union of structured subdomains - use [GroppKeyes92], PAFMA [LeaBoa92]

  18. Grids

  19. Grids (continued)

  20. Grids (continued)

  21. GridsDomain splitting strategies

  22. GridsDomain splitting strategies (continued)

  23. Dense and sparse problems • Structured, unstructured - various approaches

  24. Dense problems • Matrix problems - well or not well determined solutions - indices not necessarily linear order . Vandermonde, Toeplitz, orthogonal … structure - well conditioned: accurate regardless of computation method - ill conditioned: can be highly accurate if algorithm computes small residuals, sol. det right hand side - role ill conditioned and condition estimators . [Edel93] paradox - improved: multiply, transpose, inverse

  25. Sparse problems • Undirected weighted graphs • Compressed rows - vector rows, nested pairs column-value • Different: multiply, transpose, inverse • Det eigenvalue, eigenvector of sparse matrix • Using O(n) matrix-vector multiply [Yau93]

  26. Performance of PDE sparse solvers on hypercubes

  27. Dense approaches to sparse problems • Sparse problems contain dense blocks • Extract, process regular structure - eg. FEBA algorithm for sparse matrix-vector multiplication [Agar92] • Direct matrix factorization - decomposition smaller dense, loss performance, communication router [Kratzer 92], mapp QR different that mapp Cholesky

  28. Sparse approaches to dense problems • Recently successful, seem counterintuitive General methods [Edel93], [Freu92], [Reev93] - access matrix through matrix-vector multiplication - look for preconditioners - replace O(n2) matrix-vector multiplication with approximations (multipole, multigrid) • PAFMA – [LeaBoa92] - nonuniform problem divided uniform reg. - take advantage regular communication patterns - processor assignment in subregions with density variance. When: . load imbalance – assign dense regions . communication: assign sparse regions

  29. Stochastic nature of parallel algorithms • Variability run-time behavior algorithm - solution unpredictable, multiple paths - path nondeterministic, optimal path chosen run-time, results divergent: number pivots, solution time, alternate optima • Race conditions (time dependent decisions) - in algorithm’s design - same: problem, strategy, operating conditions - alternate optima, different: timing event, incoming variables, choices, sequence of points traversed • Eg. Self-scheduled nonuniform problems: - highly efficient machine utilization

  30. Stochastic nature of parallel algorithms (continued) • Some examples - parallel network simplex, good load balance, variability run-time behavior - branch-and-bound for integer program . good bounds affect portion of search tree explored - loops without dependencies among their iteration but sensitive other iterations, OS, application . Factoring [HumSch, Fly92], scheduling: good load balance, overhead, scalable, resistant iteration variance

  31. Mapping algorithms onto architectures • Parallelism detection - depends upon algorithm and problem nature - independent of architecture - study data dependency: explicit, implicit • Partitioning (problem decomposition) - task into processes, identify sharing objects • Allocation (distribution of tasks to processors) - influenced by memory organization, interconnection • Scheduling (ordering task execution on processors) - depends upon interconnection and PE characteristics

  32. Mapping goals(partitioning, scheduling) • Minimize communication: - exploit locality • Minimize load imbalance: - adaptive refinement

  33. Partitioning • Granularity of process coarse enough for target machine without loosing parallelism • Partitioning technique [Sarkar89] - start initial fine granularity, use heuristics to merge processes until coarsest partition reached; use cost function (depends upon critical path, overhead)

  34. Scheduling • Goal: spread load evenly on PEs (efficiency) - static versus dynamic allocation . low overhead, inflexible versus high overhead, flexible - centralized versus distributed scheduling . centralized mediation [SmithSchnabel92] . Hierarchical mediation strategy • Note: compiler, automatic tools for partitioning, scheduling -run-time (flexible, large overhead) -compile time (low overhead, need technology, good if easy estimate)

  35. The mapping problem • Assigning tasks to PEs such that Texec minimal • Model [HammondSchreiber92]: - each PE given equal work; G=(Vg, Eg) task graph (vertices = processors, edges = inter-processor communication); H=(Vh, Eh) PE graph; d = shortest path - find the surjection: such that communication load is minimized: - good results need partition, allocation, scheduling (if isolated: poor mappings, non-optimal time)

  36. The mapping problem(continued) • Mapping of communication pattern strategy (efficient) [GuptaSchenfeld93] - switching locality (sparse nature of communication graphs) - each process switches its communication between a small set of other process – ICN: PEs grouped in small clusters, intercluster and intracluster connectivity - identify this partitioning problem with bounded l-contraction of graph [RamKri92] (partitioning of vertex set into subsets such that: no subset contains more than l vertices and every subset has at least as many vertices as the number of subsets it is connected to) - simulated annealing; good results

  37. A mapping example

  38. Mapping of grid-oriented problems • Grid points geometrically adjacent - partitioning grid into subdomains (subgrids) - assigning them to PEs, each PE perform computation associated with the subdomain • Dependency & communication restricted to perimeters of subdomains • Model [RooseDries93] – structured grid: - time integration of finite-difference, finite volume discretization PDE on structured grid - 2D structured grid partitioned in subdomains of equal size - each PE performs locally updates for interior grid points - boundary grid points need neighbor information

  39. Analysis(communication, load balance, numerical efficiency) • Communication overhead depends upon: - size subdomains (large subdomains, small overhead); 2-D perimeter to surface, 3-D surface to volume ratio . communication volume proportional to perimeter subdomain Tcomm = tstartup + ntsend . for fixed-size subdomain perimeter minimum if square . block wise partitioning in square sub regions leads to minimum communication volume . communication requirements not always isotropic - machine characteristics . influence communication to computation ratio - problem characteristics . dense problems: load imbalance predominates; . sparse problems: communication predominates

  40. Analysis (continued) • Load imbalance: . minimal when block partitioning in square subdomains - work per grid may vary (cannot predict load imbalance) - mathematical models differ in various parts of the domain - boundary regions: work differs from interior cells, cells distributed almost equal among processors, achieved when subdomains square

  41. Analysis (continued) • Numerical efficiency: . accuracy results differ (depending algorithm, problem nature) - Numerical properties of inherently parallel algorithms not affected by partitioning strategy (e.g. Jacobi relaxation as opposed to Gauss-Seidel which has better convergence properties) - Runge-Kutta: update overlapping regions only after complete integration step (omitting some communication); high speedup, small convergence degradation; # of blocks is determined by the # of PEs and not domain geometry - Block tridiagonal systems: (Thomas sequential, pipelined); parallel solvers use Gaussian elimination, cyclic reduction: distribution of sequential parts over PEs at the expense of increased communication

  42. Analysis (continued) • Numerical efficiency (continued) . domain decomposition algorithm for PDEs (contain algorithmic overhead) - Schwartz domain decomposition: overlapping subdomains (iterative process, approx boundaries) - Schur complement: non-overlapping subdomains (borders computed first)

  43. Hierarchical nature of multigrid algorithms - Obtaining acceptable performance levels require optimization techniques which address the characteristics of each architectural class [MathesonTarjan93] • Model study for each architectural class • Conclusions: - fine grain machines (high variable communication cost) . optimize domain to PE topology mapping - medium grain machines (high fixed communication cost) . optimize domain partition: (well-shaped, perimeter minimum, # of neighbors small) • Parallel algorithms require accurate subspace decomposition and large number of PEs to provide practical alternatives to standard algorithms

  44. Optimal Partitioningstructured grids - Structured problems (use structured grids) - Unstructured problems (structured on subdomains) • Regular computation 2-D mesh MIMD [Lee90]: - workload and communication pattern same at each point, - communication depends upon: total communication required, actual pattern of communication (number of communicating neighbors, underlying architecture) • CCR important factor in performance evaluation (given stencil, generate best shape) - CCR depends upon: stencil, partition shape, gird (e.g. diamond-5p, hexagon-7p, square-9p star) (CCR max does not guarantee minimum execution time; CCR better performance indicator in shared –memory than message passing; CCR proportional to aspect ratio of partition)

  45. Optimal partitioningstructured grids (continued) • Optimal partitioning [ReedAdamsPatrick87] - formalize relationship: stencil, shape, underlying architecture - isolated evaluation of components yields suboptimal performance - message-passing: small versus large packets have opposite order results (stencil, shape) - type of interconnection network important (grid to network mapping must support interpartition communication pattern – otherwise performance degradation)

  46. Stencil examples

  47. Stencil examples (continued)

  48. Stencil examples (continued)

More Related