1 / 77

Multi-Tasking Models and Algorithms

Multi-Tasking Models and Algorithms. General Concepts (Part I). Outline for Multi-Tasking Models. Note : Items in black are in this slide set (Part I). Preliminaries Common Decomposition Methods Characteristics of Tasks and Interactions Mapping Techniques for Load Balancing

lela
Download Presentation

Multi-Tasking Models and Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multi-Tasking Models and Algorithms General Concepts (Part I)

  2. Outline for Multi-Tasking Models Note: Items in black are in this slide set (Part I). • Preliminaries • Common Decomposition Methods • Characteristics of Tasks and Interactions • Mapping Techniques for Load Balancing • Some Parallel Algorithm Models • The Data-Parallel Model • The Task Graph Model • The Work Pool Model • The Master-Slave Model • The Pipeline or Producer-Consumer Model • Hybrid Models

  3. Outline (cont.) • Algorithm examples for most of preceding algorithm models. • This part currently missing & need to add next time. • Some could be added as examples under Task/Channel model • Task-Channel (Computational) Model • Asynchronous Communication and Performance Evaluation • Modeling Asynchronous Communicaiton • Performance Metrics and Asynchronous Communications • The Isoefficiency Metric & Scalability • Future revision plans for preceding material. • BSP (Computational) Model • Slides posted separately on course website

  4. References • Michael Quinn, Parallel Programming in C with MPI and OpenMP, McGraw Hill, 2004. • Particularly, Chapters 3 and 7 plus algorithm examples. • Textbook slides for this book • Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar, Introduction to Parallel Computing, 2nd Edition, Addison Wesley, 2003. • Particularly, Chapter 3 (available online) • Also, Section 2.5 (Asynchronous Communications) • Slides by the Authors’ • Barry Wilkinson and Michael Allen, “Parallel Programming: Techniques and Applications • http://www-unix.mcs.anl.gov/dbpp/text/book.html • Using Networked Workstations and Parallel Computers ”, Second Edition, Prentice Hall, 2005. • Ian Foster, Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering, Addison Wesley, 1995, Online at

  5. Change in Chapter Title • This chapter consists of three sets of slides. • This chapter was formerly called Strictly Asynchronous Models • The name has now been changed to Multi-Tasking Models • However, the old name still occurs regularly in the internal slides.

  6. Specifying Asynchronous Algorithms • Identifying parts that can be done concurrently • Tasks • Mapping of the tasks onto multiple processors • Processes vs processors • Distributing the input, output, and intermediate results across different processors • Management of access to shared data • Either input or intermediate • Synchronization of the processors at various stages of the parallel execution

  7. Finding Concurrent Pieces of Work • Decomposition • The process of dividing the computation into smaller pieces of work called tasks • Tasks are programmer defined and are considered to be indivisible. • Tasks may be of arbitrary sizes • Simultaneous execution of multiple tasks is the key to reducing time required

  8. Example: Dense Matrix-Vector Multiplication • Tasks can be of different size • Granularity of Task

  9. Task-Dependency Graph • In most cases, there are dependencies between the different tasks • Certain task(s) can only start once some other task(s) have finished • Example: Producer-consumer relationships • These dependencies are represented using a DAG called a task-dependency graph

  10. Task-Dependency Graph (cont) • A task-dependency graph is a directed acyclic graph in which the nodes represent tasks and the directed edges indicate the dependences between them • The task corresponding to a node can be executed when all tasks connected to this node by incoming edges have been completed. • The number and size of the tasks that the problem is decomposed into determines the granularity of the decomposition. • Called fine-grained for a large nr of small tasks • Called coarse-grained for a small nr of large tasks

  11. Task-Dependency Graph (cont) • Key Concepts Derived from Task-Dependency Graph • Degree of Concurrency • The number of tasks that can be executed concurrently • We are usually most concerned about the average degree of concurrency • Critical Path • The longest vertex-weighted path in the graph • The weights inside nodes represent the task size • Is the sum of the weights of nodes along the path • The degree of concurrency and critical path length normally increase as granularity becomes smaller.

  12. Task-Interaction Graph • Captures the pattern of interaction between tasks • This graph usually contains the task-dependency graph as a subgraph. • True since there may be interactions between tasks even if there are no dependencies. • These interactions usually due to accesses of shared data

  13. Task Dependency and Interaction Graphs • These graphs are important in developing effective mapping of the tasks onto the different processors • Need to maximize concurrency and minimize overheads.

  14. Processes vs Processors • Process vs Processor • Considered distinct concepts in this chapter. • Process: A logical computing agent that performs tasks. • Processor: Hardware units that physically perform computation. • Usually a 1:1 correspondence between processors and processes. • However, this distinction provides additional flexibility • In order to obtain any speedup over sequential programming, parallel program must have several processes active at the same time, working on different tasks.

  15. Mapping Tasks to Processes • Mapping: The way that tasks are assigned to processes for execution. • Illustrated in Figures 3.5 and 3.7 • Good maps attempt to • Maximize the use of concurrency by mapping independent tasks onto different processors. • Minimize total completion time by ensuring that tasks on the critical path are executed as quickly as they become available. • Map tasks with a high degree of mutual interaction to the same process.

  16. Decomposition Methods • Decomposition: Technique used to split the composition into a set of tasks. • Common Decomposition techniques • Data Decomposition • Recursive Decomposition • Exploratory Decomposition • Speculative Decomposition • Hybrid Decomposition • Data and Recursive decompositions are general methods. • Exploratory & Recursive decompositions special purpose. task decomposition methods

  17. Recursive Decomposition • Suitable for problems that can be solved using the divide and conquer paradigm • Each of the subproblems generated by the divide step becomes a new task. • Results in natural concurrency, as different subproblems can be solved concurrently

  18. Example: Quicksort

  19. Another Example: Finding the Minimum • Note that we can obtain divide-and-conquer algorithms for problems that are usually solved by using other methods.

  20. Recursive Decomposition • How good are the decompositions produced? • Average Concurrency? • Length of critical path? • How do the quicksort and min-finding decompositions measure up?

  21. Data Decomposition • Used to derive concurrency for problems that operate on large amounts of data • The idea is to derive the tasks by focusing on the multiplicity of data • Data decomposition is often performed in two steps: • Step 1: Partition the data • Step 2: Induce a computational partitioning from the data partitioning. • Which data should we partition • Input/Output/Intermediate? • All of above • This leads to different data decomposition methods • How to induce a computational partitioning • Use the “owner-computes” rule

  22. Example: Matrix-Matrix Multiplication

  23. Matrix-Matrix Example (cont) Note tasks created by previous decomposition is not unique.

  24. Partitioning Intermediate Data • The partitioning of the matrix multiplication in Figure 3.10 into four tasks can be partitioned further by partitioning intermediate data. • See next slide • The matrices Di,j created are not computed in sequential algorithm and requires a change in sequential algorithm. • Additionally, the creation of Di,j matrices require additional storage space.

  25. “Owner-Computes" Rule • Used when data decomposition is used to partition the work into tasks. • This general principle requires that each partition performs all computations that involve the data it owns. • This is illustrated in the next two slides.

  26. Exploratory Decomposition • Used to decompose computations that correspond to a search of the space of solutions. • The search space is partitioned into smaller parts and these are concurrently searched until desired solution is found. • The next slide shows the initial configuration for the 15 puzzle and a sequence of moves leading to the final configuration. • The subsequent slide shows how the state a state space search leads to the solution.

  27. Exploratory Decomposition • Not general purpose • After sufficient branches are generated, each node can be assigned the task to explore further down one branch • As soon as one task finds a solution, the other tasks can be terminated. • It can result in speedup and slowdown anomalies • The work performed by the parallel formulation of an algorithm can be either smaller or greater than that performed by the serial algorithm.

  28. Exploratory Decomposition • Not general purpose • Can result in speedup anomalies • Either engineered slow-down or superlinear speedup.

  29. Speculative Decomposition • Used to extract concurrency in problems in which the next step is one of several actions that can only be determined when the current task finishes. • While the current task is executing, other tasks can perform the computation of the multiple branches in parallel • This decomposition method guarantees some wasteful computation. • An alternate version is to explore only the most promising branch • Or most promising branches

  30. Speculative Decomposition • Difference from exploratory decompostion • In speculative decomposition, the input at a branch leading to multiple tasks is unknown. • In exploratory decomposition, the output of the multiple tasks originating at the branch is unknown. • Speculative decomposition can lead to more, less, or the same amount of work compared to the serial program.

  31. Speculative Execution • If predictions are wrong • Work is wasted • Work may need to be undone • State-restoring overhead • Memory/computations • However, it may be the only way to extract concurrency!

  32. Characteristics of Tasks • Task Generation • Static: All tasks are known before execution of algorithm starts. • Data decomposition usually results in static tasks • Example: Matrix Multiplication • Task Sizes • Relative amount of time to complete it • Uniform tasks: All require the same time • Non-uniform tasks: Execution time varies significantly. • Size of Data needed by a Task • Data must be available to process performing task • The size & location of this data may determine best process to perform task.

  33. Some Task Interaction Characteristics • Static vs Dynamic Interactions • Static interactions occur at predetermined times and involved predetermined tasks. • Ex: Matrix multiplication • Otherwise, interaction is dynamic • 15 puzzle – Tasks that finish their work can pick up an unexplored state from queue of another busy task. • Regular vs Irregular Interactions • Regular if has some structure that can be used to obtain efficient implementation • Otherwise, irregular. • Ex: In sparse matrix-vector multiplication, must scan row of matrix to find out which of the vector entries are needed

  34. Some Task Interactions Characteristics (cont) • Read-only vs Read-Write Data Sharing • Read-only: Task only needs to read data shared with other tasks • Ex: Matrix multiplication in Fig 3:10 • Read-Write: Multiple tasks need to read and write to some shared data. • Using heuristic search solution to solve 15 puzzle.

  35. Mapping Tasks to Processors • A good mapping strives to achieve the following conflicting goals: • Reducing the amount of time processor spend interacting with each other. • Reducing the amount of total time that some processors are active while others are idle. • Good mappings attempt to reduce the parallel processing overheads • If Tp is the parallel runtime using p processors and Ts is the sequential runtime (for the same algorithm), then the the total overheadTo is p×Tp – Ts. • This is the work that is done by the parallel system that is beyond that required for the serial system.

  36. Mapping Tasks to Processors (cont) • Two Main sources of overheads • Load inbalance • Results in process inactivity during execution • Inter-process communications • Coordination • Synchronization • Data-sharing • Goal of mapping tasks to processes is to minimize the overheads. • Goal of minimizing both of above overheads are often in conflict with each other.

  37. Why Mappings can be Complicated • Mappings need to consider the task-dependency graph • Are tasks available a priority? • Static vs dynamic task generation • Computation requirements factors • Are they uniform or non-uniform • Do we know tasks a priority • How much data is associated with each task • Mappings need to consider the task-interaction graph to determine the interactions between tasks • Are they static or dynamic • Do we know about them a priori • Are they data instance dependent • Are they regular or irregular • Are they read-only or read-write? • Depending on above characteristics, different mapping techniques are required with differing complexities and costs.

  38. Simple & Complex Task Interactions Example • Consider the task-interaction graph for image dithering • The color of each pixel is determined as weighted average of its original color and values of neighboring pixels • If break image up into square regions and assign a different task to each, have simple task interactions • Consider sparse matrix-vector graph. • Assign i-th row and i-th vector value to i-th task. • If j-th entry in i-th row is non-zero, then i-th row must obtain the j-th vector value from j-th task (unless i=j). • Result is a complex task interaction graph.

  39. Example: Simple & Complex Task Interactions

  40. Mapping Techniques for Load Balancing • Problem: The assignment of tasks who total computational requirements are the same does not automatically ensure load balanced. • Each processor below is assigned three tasks, but (a) is better than (b).

  41. Load Balancing Techniques • Static Mapping • The tasks are distributed among the processors prior to execution • Applicable for tasks that are • Generated statically • Known and/or uniform computational requirements • Optimal mapping for non-uniform tasks is NP-hard so requires a heuristic mapping for acceptable solutions • Dynamic Mapping • The tasks are distributed among the processors during the execution of the algorithm • i.e., tasks & data are migrated during execution • Applicable for tasks that are either • Generated dynamically • Unknown computational requirements

  42. Static Mapping – Array Distribution • Suitable for algorithms that • Use data decomposition • Their underlying data is in the form of arrays • i.e., input, output, or intermediate data • Block Distribution • Cyclic Distribution • Block-Cyclic Distribution • Randomized Distribution 1D/2D/3D

  43. 1D Block Distributions • Partitioning a nm two-dimensional array along one dimension among p processes. • Process k can be given the k-th block of n/p consecutive rows. • i.e, row numbers kn/p, ... ,(k+1)n/p is given to process k. • If n/p is not an integer, • all processes except the last can be given a block of n/p rows and last process the remaining block of rows • Alternately, the initial rows could receive n/p rows, and the rest receive n/p -1 rows • Similarly, process k can be given the k-th block of m/p consecutive columns.

  44. 2D Block Distributions • We could partition along more than one dimension. • With a d-dimensional array, we can partition along up to d dimensions. • If we have p process and p = p1p2, the p2, n we could partition an nn block into p subblocks of size n/p1 n/p2 and assign one to each process. • The preceding 1D and 2D distributions are illustrated in the next slide.

  45. Example: Block Distributions

More Related