1 / 36

Mapping and Scheduling

Mapping and Scheduling. W+A: Chapter 4. Outline. Mapping and Scheduling Static Mapping Strategies Dynamic Mapping Strategies Scheduling. Mapping and Scheduling Models. Basic Models: Program model is a task graph with dependencies

vesna
Download Presentation

Mapping and Scheduling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mapping and Scheduling W+A: Chapter 4 CSE 160/Berman

  2. Outline • Mapping and Scheduling • Static Mapping Strategies • Dynamic Mapping Strategies • Scheduling CSE 160/Berman

  3. Mapping and Scheduling Models • Basic Models: • Program model is a task graph with dependencies • Platform model is set of processors with interconnection network CSE 160/Berman

  4. Mapping and Scheduling • Mapping and scheduling involve the following activities • Select a set of resources on which to schedule the task(s) of the application. • Assign application task(s) to compute resources. • Distribute data or co-locate data and computation. • Order tasks on compute resources. • Order communication between tasks. CSE 160/Berman

  5. Select a set of resources on which to schedule the task(s) of the application. Assign application task(s) to compute resources. Distribute data or co-locate data and computation. Order tasks on compute resources. Order communication between tasks. 1 = resource selection 1-3: generally termed as mapping 4-5: generally termed as scheduling For many researchers, scheduling is also used to describe activities 1-5. Mapping is an assignment of tasks in space Scheduling focuses on ordering in time Mapping and Scheduling Terminology CSE 160/Berman

  6. Goals • Want the mapping and scheduling algorithms and models to promote the assignment/ordering with the smallest execution time • Accuracy vs. Ranking Model Real Stuff Model Real Stuff A x A’ optimum A optimum B B’ B CSE 160/Berman

  7. 3 P1 P1 1 3 2 7 1 1 4 P2 P2 1 7 4 2 1 2 2 What is the best mapping? CSE 160/Berman

  8. Static and Dynamic Mapping Strategies • Static methods generate the partitioning prior to execution • Static mapping strategies work well when we can reasonably predict the time to perform application tasks during execution • When it is not easy to predict task execution time, dynamic strategies may be more performance-efficient • Dynamic methods generate the partitioning during execution • For example, workqueue and M/S are dynamic methods CSE 160/Berman

  9. P13 P17 P2 P3 P5 P7 P11 Static Mapping • Static mapping can involve • partitioning of tasks (functional decomposition) • Sieve of Eratosthenes an example • partitioning of data (data decomposition) • Fixed decomposition of Mandelbrot (k blocks per processor) is an example of this CSE 160/Berman

  10. Load Balancing • Load Balancing = strategy to partition application so that • All processors perform an equivalent amount of work • (All processors finish in an equivalent amount of time. This is really time-balancing) • May take different amounts of time to do equivalent amounts of work • Load balancing an important technique in parallel processing • Many ways to achieve a balanced load • Both dynamic and static load balancing techniques CSE 160/Berman

  11. Static and Dynamic Mapping for the N-body Problem • The N-body problem: Given n bodies in 3D space, determine the gravitational force Fbetween them at any given point in time. where G is the gravitational constant, r is the distance between the bodies, and are the masses of the bodies CSE 160/Berman

  12. Exact N-body serial pseudo-code • At each time t, velocity v and position x of body i may change • Real problem a bit more complicated than this. See 4.2.3 in book • For (t=0: t<max; t++) • For (i=0; i<N; i++) { • F= Force_routine(i); • v[i]_new = v[i]+F*dt; • x[i]_new=x[i]+v[i]_new*dt; • } • For (i=0; i<nmax; i++) { • x[i] = x[i]_new; • v[i]=v[i]_new; • } CSE 160/Berman

  13. Exact N-body and static partitioning • Can parallelize n-body by tagging velocity and position for each body and updating bodies using correctly tagged information. • This can be implemented as a data parallel algorithm. What is the worst-case complexity of complexity for a single iteration? • How should we partition this? • Static partitioning can be a bad strategy for n-body problem. • Load can be very unbalanced for some configurations CSE 160/Berman

  14. Improving the complexity of the N-body code • Complexity of serial n-body algorithm very large: O(n^2) for each iteration. • Communication structure not local – each body must gather data from all other bodies. • Most interesting problems are when n is large – not feasible to use exact method for this • Barnes-Hut algorithm is well-known approximation to exact n-body problem and can be efficiently parallelized. CSE 160/Berman

  15. Barnes-Hut Approximation • Barnes-Hut algorithm based on the observation that a cluster of distant bodies can be approximated as a single distant body • Total mass = aggregate of bodies in cluster • Distance to cluster = distance to center of mass of the cluster • This clustering idea can be applied recursively CSE 160/Berman

  16. Barnes-Hut idea • Dynamic divide and conquer approach: • Each region (cube) of space divided into 8 subcubes • If subcube contains more than 1 body, it is recursively subdivided • If subcube contains no bodies, it is removed from consideration • 2D example on right – each 2D region divided into 4 subregions CSE 160/Berman

  17. Barnes-Hut idea • For 3D decomposition, result is an octtree • For 2D decomposition, result is a quadtree, (pictured below). CSE 160/Berman

  18. Barnes Hut Pseudo-code • For (t=0; t< tmax; t++) { Build octtree; Compute total mass and center; Traverse the tree, computing the forces Update the position and velocity of all bodies } • Notes: • Total mass and center of mass of each subcube stored at its root • Tree traversal stops at a node when the clustering approximation can be used for a particular body • In the gravitational n-body problem described here, this can happen when where r is the distance to the center of mass of a subcube of side d and c is a constant. CSE 160/Berman

  19. Barnes-Hut Complexity • Partitioning is dynamic: Whole octtree must be reconstructed for each time step because bodies will have moved. • Constructing tree can be done in O(nlogn) • Computing forces can be done in O(nlogn) • Barnes-Hut for one iteration is O(nlogn) [compare to O(n^2) for one iteration with exact solution] CSE 160/Berman

  20. Generalizing the Barnes-Hut approach • Approach can be used for applications which repeatedly performs some calculation on particles/bodies/data indexed by position. • Recursive Bisection: • Divide region in half so that particles are balanced each time • Map rectangular regions onto processors so that load is balanced CSE 160/Berman

  21. Recursive Bisection Programming Issues • How do we keep track of the regions mapped to each processor? • What should the density of each region be? [granularity!] • What is the complexity of performing the partitioning? How often should we repartition to optimize the load balance? • How can locality of communication or processor configuration be leveraged? CSE 160/Berman

  22. Scheduling • Application scheduling: ordering and allocation of tasks/communication/data to processors • Application-centric performance measure, e.g. minimal execution time • Job Scheduling: ordering and allocation of jobs on an MPP • System-centric performance measure, e.g. processor utilization, throughput CSE 160/Berman

  23. Job Scheduling Strategies • Gang-scheduling • Batch scheduling using backfilling CSE 160/Berman

  24. Gang scheduling • Gang scheduling is a technique for allocating a collection of jobs on a MPP • One or more jobs clustered as a gang • Gangs share time slices on whole machine • Strategy combines time-sharing (gangs get time slices) and space-sharing (gangs partition space) approaches • Many flavors of gang scheduling in the literature CSE 160/Berman

  25. Gang Scheduling • Formal definition from Dror Feitelson: • Gang scheduling is a scheme that combines three features: • The threads of a set of jobs are grouped into gangs with the threads in a single job considered to be a single gang. • The threads in each gang execute simultaneously on distinct PEs, using a 1-1 mapping. • Time slicing is used, with all the threads in a gang being preempted and rescheduled at the same time. CSE 160/Berman

  26. Why gang scheduling? • Gang scheduling promotes efficient performance of individual jobs as well as efficient utilization and fair allocation of machine resources. • Gang scheduling leads to two desirable properties: • It promotes efficient fine-grain interactions among the threads of a gang, since they are executing simultaneously. • Periodic preemption prevents long jobs from monopolizing system resources. • overhead of preemption can reduce performance and so must be implemented efficiently). • Used as the scheduling policy for CM-5, Meiko CS-2, Paragon, etc. CSE 160/Berman

  27. Batch Job Scheduling • Problem: How to schedule jobs waiting in a queue to run on a multicomputer? • Each job requests some number n of nodes and some time t to run • Goal: promote utilization of machine, fairness to jobs, short queue wait times CSE 160/Berman

  28. One approach: Backfilling • Main idea:pack the jobs in the processor/time space • Allow job at the head of the queue to be scheduled in the first available slot. • If other jobs in the queue can run without changing the start time of previous jobs in the queue, schedule them. • Promote jobs if they can start earlier • Many versions of backfilling: • EASY: Promote jobs as long as they don’t delay the start time of the first job in the queue • Conservative: Promote jobs as long as they don’t delay the start time of any job in the queue. CSE 160/Berman

  29. processors time Backfilling Example • Submitting five requests… CSE 160/Berman

  30. processors processors time time Backfilling Example • Submitting five requests… • Using Backfilling... CSE 160/Berman

  31. processors processors time time Backfilling Example CSE 160/Berman

  32. processors processors time time Backfilling Example CSE 160/Berman

  33. processors time Backfilling Example • Existing job finishes • Backfilling promotes yellow job and then schedules purple job processors processors time CSE 160/Berman

  34. Backfilling Scheduling • Backfilling used in Maui Scheduler at SDSC on SP-2, PBS at NASA, Computing Condominium Scheduler at Penn State, etc. • Backfilling Issues: • What if the processors of the platform have different capacities (are not homogeneous) ? • What if some jobs get priority over others? • Should parallel jobs be treated separately than serial jobs? • If multiple queues are used, how should they be administered? • Should users be charged to wait in the queue as well as run on the machine? CSE 160/Berman

  35. Optimizing Application Performance • Backfilling and MPP scheduling strategies typically optimize for throughput • Optimizing throughput and optimizing application performance (e.g. execution time) can often conflict • How can applications optimize performance in an MPP environment? • Moldable jobs = jobs which can run with more than one partition size • Question: What is the optimal partition size for moldable jobs? • We can answer this question when the MPP scheduler runs a conservative backfilling strategy and publishes the list of available nodes.

  36. Optimizing Applications targeted to a Batch-scheduled MPP • SA = generic AppLeS scheduler developed for jobs submitted to backfilling MPP • uses the availability list of the MPP scheduler to determine the size of the partition to be requested by the application • Speedup curve known for Gas applications • Static = jobs submitted without SA • Workload taken from KTH (Swedish Royal Institute of Technology) • Experiments developed by Walfredo Cirne

More Related