1 / 41

A Survey on Scheduling Methods of Task-Parallel Processing

A Survey on Scheduling Methods of Task-Parallel Processing. Chikayama and Taura Lab M1 48-096415 Jun Nakashima. Agenda. Introduction Basic Scheduling Methods Challenges and solutions Consideration Summary. Motivation. Thread and task have many in common Both are unit of execution

zev
Download Presentation

A Survey on Scheduling Methods of Task-Parallel Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Survey on Scheduling Methods of Task-ParallelProcessing Chikayama and Taura Lab M1 48-096415 Jun Nakashima

  2. Agenda • Introduction • Basic Scheduling Methods • Challenges and solutions • Consideration • Summary

  3. Motivation • Thread and task have many in common • Both are unit of execution • Multiple threads/tasks may be executed simultaneously • Scheduling methods of tasks can be useful for that of threads

  4. Background • Demand of exploitingdynamic and irregular parallelism • Simple parallelization (pthread,OpenMP,…) isnotefficient • Few threads : Difficulties of load balancing • Many threads : Good load balance but overhead is not bearable • Example: • N-Queenspuzzle • Strassen’s algorithm (matrix-matrix product) • LU Factorization of sparse matrix

  5. Task-Parallel Processing • Decompose entire process into tasks and execute them in parallel • Task : Unit of execution much lighter than thread • Fairness of tasks are not considered • May be deferred or suspended • Representation of dependence • Task creation by a task • Wait for child tasks • Programming environments with task support : • Cilk,X10, Intel TBB, OpenMP(>3.0),etc…

  6. Task-Parallel Processing(2) A simple example Task graph tasktask_fib(n) { if (n<=1)return 1; t1=create_task(task_fib(n-2)); //create task t2=create_task(task_fib(n-1)); ret1=task_wait(t1); //wait for children ret2=task_wait(t2); return ret1+ret2; } fib(n) fib(n-2) fib(n-1) fib(n-4) fib(n-3) fib(n-3) fib(n-2) Tasks of same color can be executed in parallel

  7. Basic execution model • Forks threads up to the number of CPU cores • Each thread has queues for tasks • Assign a task by one thread fib(n) fib(n) Thread1 Thread2 fib(n-2) fib(n-2) fib(n-1) fib(n-1) fib(n-4) fib(n-3) fib(n-3) fib(n-2)

  8. Agenda • Introduction • Basic Scheduling Methods • Challenges and solutions • Consideration • Summary

  9. Basic scheduling strategy : Breadth-Firstand Work-first Breadth-First Work-First At task creation Parent task always suspends and run child Continue parent when child task is finished • At task creation : • Enqueue new task • Execute child when parent task suspends Thread Thread ready running running ready running ready fib(n-4) waiting running running ready fib(n) fib(n-2) fib(n-1) fib(n) fib(n-2)

  10. Work stealing • Load-balancing technique of threads for work-first scheduler • Idle threads steal runnable tasks from other threads • Basic strategy : FIFO • Steals oldest task in the task queue • Victim thread should be chosen at random Thread Thread fib(n-4) running ready fib(n-2) running ready fib(n) running ready fib(n-1) Steal request Steals oldest task

  11. Effect of Work Stealing Thread 2’s task Task graph of previous page • Old task tends to create many tasks in the future • Especially recursive parallelism Thread 1’s task fib(n) fib(n-2) fib(n-1) fib(n-2) fib(n-4) fib(n-3) fib(n-3)

  12. Lazy Task Creation • Save continuation of parent task instead of creating child task • Continuation is lighter than task • At work stealing, crate task from continuation and steal it Thread Thread fib(n-4) running ready fib(n-2) fib(n-2) running ready fib(n) fib(n) running ready fib(n-1) fib(n) Steal request Create task and steal it Continutation (≠ Task)

  13. Cut-off • Execute child task sequentially instead of creating • To avoid too fine-grained tasks • Basic cut-off strategy • Amount of tasks • Recursive depth fib(n) Execute serially fib(n-2) fib(n-1) fib(n-2) fib(n-4) fib(n-4) fib(n-3) fib(n-3) fib(n-3)

  14. Agenda • Introduction • Basic Scheduling Methods • Challenges and solutions • Consideration • Summary

  15. Challenges • Architecture-aware scheduling • Scalable implementation • Determination of cut-off threshold

  16. Architecture-aware scheduling • Basic methods are not considered of architecture • In some architecture performance is degraded • Example : NUMA architecture Interconnect Core1 Core2 Core3 Core4 Memory Memory

  17. NUMA Architecture • NUMA = Non Uniform Memory Access • Memory access cost depends on CPU core and address • Considering locality is very important! Remote memory access is slow Local memory access is fast Interconnect Core1 Core2 Core3 Core4 Memory Memory

  18. A bad case on NUMA • When a thread steals a task of remote CPU • More remote memory access Local memory access Remote memory access task Core1 Core2 Core3 Core4 Memory Memory data

  19. Affinity Bubble-Scheduler • Scheduling Dynamic OpenMP Applications over Multicore Architecture(Broquedis et al.) • Locality-aware thread scheduler • Based on BubbleSched: • Framework to implement scheduler on hieratical architecture • Threads are grouped by bubbles • Scheduler uses bubbles as hints

  20. What is bubble? task task • Group of tasks and bubbles • Describes affinities of tasks • Call library function to create • Grouped tasks use shared data task task task task task task

  21. Initial task distribution • Explode bubbles hieratically Core1 Core2 Core3 Core4 task task task task task task task task Explode the root bubble Divide to balance load Explode a bubble to distribute to 2 CPU cores

  22. NUMA-aware Work Stealing • Idle threads steal tasks from as local thread as possible Core1 Core2 Core3 Core4 task task task task task task task Steals from local

  23. Challenges • Architecture-aware scheduling • Affinity Bubble-scheduler • Scalable implementation • Determination of cut-off threshold

  24. Scalable implementation Need to lock the entire queue • When operating task queues, threads have to acquire a lock • Because task queues may be accessed by multiple threads • Task queue operation occur every task creation and destruction • Locks may be serious bottleneck! Finished! Thread Thread task task task Steal request

  25. A simple way to decrease locks • Double Task Queue per thread • One for local and one for public • Tasks are stolen only from public queue • Local queue is lock-free lock-free! Thread Thread task local task task public task Need to lock the public queue only Steal request

  26. Problem of double task queue • When task is moved, memory copy is required Thread Task copy is required local task public

  27. Split Task Queues • Scalable Work Stealing (Dinal et al.) • Split task queue by “split pointer” • From head to split pointer: Localpotion • From split pointer to tail: Publicpotion lock-free! Thread Thread task local task task public

  28. Split Task Queues • Move pointer to head if public potion gets empty • This operation is lock-free • Move pointer to tail if local potion gets empty • Task copy is not required Thread Thread task local task task task task task public

  29. And more… • In “Scalable work stealing” (Dianl et al.) • Efficient task creation • Initialize task queue directly • Better amount of tasks to steal • Half of public queue

  30. Challenges • Architecture-aware scheduling • Affinity Bubble-Scheduler • Scalable implementation • Split Task Queues • Determination of cut-off threshold

  31. Determination of cut-off threshold • Appropriate cut-off threshold cannot be determined simply • Depends on algorithm, scheduling methods, and input data • Too large : Tasks become too coarse-grained • Leads to load imbalance • Too small : Tasks become too fine-grained • Large overhead

  32. Profile-based cut-off determination • An adaptive cut-off for task parallelism (Duran et al.) • Use 2 profiling methods • Full Mode • Minimal Mode • Estimate execution time and decide cut-off

  33. Full Mode • Measure every tasks’ execution time • Heavy overhead • Complete information fib(n) Collect execution time fib(n-2) fib(n-1) ??? XXX ms ??? YYY ms fib(n-2) ??? ZZZ ms fib(n-4) fib(n-3) fib(n-3)

  34. Minimal Mode • Measure execution time of “real tasks” • Small overhead • Incomplete information • Cut-off tasks are not measured fib(n) fib(n-2) fib(n-1) Collect execution time These tasks are not measured ??? XXX ms ??? YYY ms fib(n-2) fib(n-2) fib(n-4) fib(n-4) fib(n-3) fib(n-3) fib(n-3) fib(n-3)

  35. Adaptive Profiling • Collects execution time for each depth of recursion • Use Full Mode until enough information is collected • After that, use Minimal Mode fib(n) Profiled(Full Mode) Maybe not profiled(Minimal Mode) fib(n-2) fib(n-1) fib(n-2) fib(n-4) fib(n-3) fib(n-3) 4 1 Execution order 2 3

  36. Cut-off strategy • Estimates execution time of the task by collected information • Average of previous executions • If estimated execution time is smaller than threshold, apply cut-off How long the task will take ? fib(n) If estimated time is larger, create new task and execute in parallel If estimated time is smaller, execute serially fib(n-2) fib(n-1) fib(n-2) fib(n-2) fib(n-4) fib(n-3) fib(n-3) fib(n-3) 4 1 Execution order 2 3

  37. Agenda • Introduction • Basic Scheduling Methods • Challenges and solutions • Consideration • Summary

  38. Consideration • When adopting task methods into thread scheduling, it is necessary to consider side-effect • Main difference between task and thread is fairness • Fairness : Runnable threads take equal CPU time (based on priority) • Any thread never keeps CPU forever

  39. Consideration of fairness • Affinity Bubble Scheduler • Originally designed for threads • Split task queues • Data structure for reducing locks improves scalability • Basic idea does not impede fairness • Profile-based cut-off • Can apply cut-off only short-lived thread • It makes easier to apply cut-off

  40. Summary • Basic scheduling methods • Challenges and solutions • Architecture-aware scheduling • Affinity Bubble-Scheduler • Scalable implementation • Split Task Queues • Determination of cut-off threshold • Profile-based cut-off • Consideration • These solutions are NOT SO harmful for fairness

  41. Thanks for your attention!

More Related