1 / 62

Automatically Tuning Task-Based Programs for Multi-core Processors

Automatically Tuning Task-Based Programs for Multi-core Processors. Jin Zhou Brian Demsky Department of Electrical Engineering and Computer Science University of California, Irvine. Motivation. Recent microprocessor trends Number of cores increased rapidly Architectures vary widely

kaelem
Download Presentation

Automatically Tuning Task-Based Programs for Multi-core Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatically Tuning Task-Based Programs for Multi-core Processors Jin Zhou Brian Demsky Department of Electrical Engineering and Computer Science University of California, Irvine

  2. Motivation • Recent microprocessor trends • Number of cores increased rapidly • Architectures vary widely • Challenges for software development • Parallelization is now key for performance • Current parallel programming model: threads + locks • Hard to develop correct and efficient parallel software • Hard to adapt software to changes in architectures

  3. Goals • Automatically generate parallel implementation • Automatically tune parallel implementation

  4. Overview Profile Data Program Processor Specification Bamboo Compiler Implementation Generator Candidate implementations Simulation-based Evaluator Leading implementations Implementation Optimizer Optimized implementation Tuned implementations Code Generator Optimized multi-core binary Multi-core Processor

  5. Example • MonteCarlo Example • Partitions problem into several simulations • Executes the simulations in parallel • Aggregates results of all simulations

  6. Bamboo Language • A hybrid language combines data-flow and Java • Programs are composed of tasks • Tasks compose with dataflow-like semantics • Tasks contain Java-like object-oriented code internally • Programs cannot explicitly invoke tasks • Runtime automatically invokes tasks • Supports standard object-oriented constructs including methods and classes

  7. Bamboo Language • Flags • Capture current role (type state) of object in computation • Each flag captures an aspect of the object’s state • Change as the object’s role evolves in program • Support orthogonal classifications of objects

  8. class Simulator { flag run; flag submit; flag finished; ... } task startup(StartupObject s in initialstate) { Aggregator aggr = new Aggregator(s.args[0]){merge:=true}; for(int i = 0; i < 4; i++) Simulator sim = new Simulator(aggr){run:=true}; taskexit(s: initialstate:=false); } task simulate(Simulator sim in run) { sim.runSimulate(); taskexit(sim: run:=false, submit:=true); } task aggregate(Aggregator aggr in merge, Simulator sim in submit) { boolean allprocessed = aggr.aggregateResult(sim); if (allprocessed) taskexit(aggr: merge:=false, finished:=true; sim: submit:=false, finished:=true); taskexit(sim: submit:=false, finished:=true); } class Aggregator { flag merge; flag finished; … }

  9. Bamboo Program Execution Global Flagged Object Space Runtime initialization new StartupObject StartupObject initialstate state finished state

  10. Bamboo Program Execution Global Flagged Object Space execute on startup task StartupObject StartupObject finished state initialstate state

  11. Bamboo Program Execution set Global Flagged Object Space startup task StartupObject new Aggregator Simulator Simulator Simulator Simulator StartupObject initialstate state finished state finished state merge state Aggregator Simulator submit state finished state run state

  12. Bamboo Program Execution Global Flagged Object Space StartupObject Aggregator execute on execute on simulate task simulate task simulate Simulator Simulator Simulator Simulator simulate task simulate task execute on execute on StartupObject initialstate state finished state merge state finished state Aggregator Simulator run state submit state finished state

  13. Bamboo Program Execution Global Flagged Object Space StartupObject Aggregator set set simulate task simulate task Simulator Simulator Simulator Simulator simulate task simulate task set set StartupObject initialstate state finished state finished state Aggregator merge state Simulator run state submit state finished state

  14. Bamboo Program Execution Global Flagged Object Space aggregate task StartupObject execute on Aggregator Simulator Simulator Simulator Simulator StartupObject initialstate state finished state Aggregator merge state finished state Simulator run state submit state finished state

  15. Bamboo Program Execution aggregate task Global Flagged Object Space StartupObject Aggregator set Simulator Simulator Simulator Simulator StartupObject finished state initialstate state finished state merge state Aggregator Simulator submit state run state finished state

  16. Bamboo Program Execution Global Flagged Object Space StartupObject Aggregator execute on aggregate task Simulator Simulator Simulator Simulator StartupObject finished state initialstate state finished state merge state Aggregator Simulator submit state run state finished state

  17. Bamboo Program Execution Global Flagged Object Space StartupObject Aggregator set aggregate task Simulator Simulator Simulator Simulator StartupObject finished state initialstate state finished state merge state Aggregator Simulator submit state run state finished state

  18. Bamboo Program Execution Global Flagged Object Space StartupObject Aggregator Simulator Simulator aggregate task Simulator Simulator execute on StartupObject finished state initialstate state finished state merge state Aggregator Simulator submit state run state finished state

  19. Bamboo Program Execution Global Flagged Object Space StartupObject Aggregator Simulator Simulator aggregate task set Simulator Simulator StartupObject finished state initialstate state finished state merge state Aggregator Simulator submit state run state finished state

  20. Bamboo Program Execution Global Flagged Object Space StartupObject Aggregator Simulator Simulator aggregate task Simulator Simulator execute on StartupObject finished state initialstate state finished state merge state Aggregator Simulator submit state run state finished state

  21. Bamboo Program Execution Global Flagged Object Space StartupObject Aggregator Simulator Simulator aggregate task Simulator Simulator set StartupObject finished state initialstate state finished state merge state Aggregator Simulator submit state run state finished state

  22. Implementation Generation Profile Data Bamboo Program Processor Specification Bamboo Compiler Implementation Generator Candidate implementations Simulation-based Evaluator Leading implementations Implementation Optimizer Optimized implementation Tuned implementations Code Generator Optimized multi-core binary Multi-core Processor

  23. Implementation Generation • Dependence Analysis:analyzes data dependence between tasks • Parallelism Exploration: extracts potential parallelism • Mapping to Cores: maps the program to real processor

  24. Flag State Transition Graph (FSTG) Simulator run simulate:32Mcyc; 100% submit aggregate:2Mcyc; 100% finished

  25. Combined Flag State Transition Graph (CFSTG) StartupObject initialstate Number of new objects startup:3Mcyc; 100% finished 1 4 Simulator Aggregator run merge aggregate:2Mcyc; 75% simulate:32Mcyc; 100% aggregate:2Mcyc; 25% submit finished aggregate:2Mcyc; 100% finished

  26. Initial Mapping Core Group StartupObject initialstate startup:3Mcyc; 100% finished 1 4 Simulator Aggregator run merge simulate:32Mcyc; 100% aggregate:2Mcyc; 75% aggregate:2Mcyc; 25% submit finished aggregate:2Mcyc; 100% finished

  27. Preprocessing Phase • Identifies strongly connected components (SCC) and merges them into a single core group • Converts CFSTG into a tree of core groups by replicating core groups as necessary

  28. Data Locality Rule StartupObject • Default rule • Maximize data locality to improve performance • Minimizes inter-core communications • Improves cache behavior initialstate startup:3Mcyc; 100% finished 1 4 Aggregator merge Simulator aggregate:2Mcyc; 75% run aggregate:2Mcyc; 25% finished StartupObject 4 1 Simulator Aggregator

  29. Data Parallelization Rule StartupObject • To explore potential data parallelism initialstate startup:3Mcyc; 100% finished 1 4 Aggregator merge Simulator aggregate:2Mcyc; 75% run aggregate:2Mcyc; 25% finished 1 StartupObject Simulator 1 1 StartupObject Aggregator 4 Simulator 1 Simulator Aggregator Simulator Simulator 1 1

  30. Rate Matching Rule Producer • If the producer executes multiple times in a cycle, how many consumers are required? • Match two rates to estimate the number of consumers • Peak new object creation rate • Object consumption rate produce init Consumer produce run … Consumer Producer … … Consumer

  31. Mapping to Processor • Extended CFSTG • Constraint: limited cores 1 StartupObject Simulator 1 1 Aggregator Simulator Simulator Simulator 1 1 Core 1 Core 2 • Map CFSTG core groups to physical cores

  32. Mapping to Cores • One possible mapping Core 1 1 StartupObject Simulator 1 1 Aggregator Simulator Core 2 Simulator Simulator 1 1

  33. Mapping to Cores • Isomorphic mappings: have same performance Core 1 Core 1 1 1 StartupObject StartupObject Simulator Simulator 1 1 1 1 Core 2 Aggregator Aggregator Simulator Simulator Core 2 Simulator Simulator Simulator Simulator 1 1 1 1 • Backtracking-based search: to generate non-isomorphic implementations

  34. Implementation Generation Profile Data Bamboo Program Processor Specification Bamboo Compiler Implementation Generator Candidate implementations Simulation-based Evaluator Leading implementations Implementation Optimizer Optimized implementation Tuned implementations Code Generator Optimized multi-core binary Multi-core Processor

  35. Simulation-Based Evaluation • To select the best candidate implementation • High-level simulation • Does NOT actually execute the program • Constructs abstract execution trace with similar statistics • Compare the execution time or throughput and core usage Simulator Core Core Task Task Task Task

  36. Simulation-Based Evaluation • Markov model • Built from profile data • For each task estimates: • The destination state • The execution time • A count of each type of new objects StartupObject Simulator initialstate 1 startup:3Mcyc; 100% fnished Simulator 1 1 Aggregator 1 merge aggregate:2Mcyc; 75% Simulator aggregate:2Mcyc; 25% finished 1 Simulator run simulate:32Mcyc; 100% submit aggregate:2Mcyc; 100% finished

  37. Simulated Execution Trace core 0 core 1 StartupObject(1) 0 startuptask 3 Aggregator(1), Simulator (4) transfer a Simulator Simulator(1) 4 simulatetask simulatetask Aggregator(1), Simulator(1), Simulator(2) 35 Simulator(1) 36 Aggregator(1), Simulator(2), Simulator(2) transfer a Simulator 37 simulatetask Aggregator(1), Simulator(3), Simulator(1) 67 simulatetask 99 Aggregator(1), Simulator(4) Aggregator(1), Simulator(4) aggregatetask Aggregator(1), Simulator(3) 101 1 Aggregator in the initial state and 4 Simulators in the submit state aggregatetask Aggregator(1), Simulator(2) 103 aggregatetask Aggregator(1), Simulator(1) 105 aggregatetask 107 empty

  38. Problem of Exhaustive Searching • The search space expands quickly • Exhaustive search is not feasible for complicated applications

  39. Random Search? • Very low chance to find the best implementation Chance to find the best implementation

  40. Developer Optimization Process • Create an initial implementation • Evaluate it and identify performance bottlenecks • Heuristically develop new implementations to remove bottlenecks • Iteratively repeat evaluation and optimization

  41. Directed Simulated Annealing (DSA) Randomly generate candidate implementations Directed Simulated Annealing High-level Simulator Leading candidate implementations As-built Critical Path Analysis Potential bottlenecks New candidate implementations Implementation Generator Tuned candidate implementation

  42. As-Built Critical Path (ABCP) core 0 core 1 • Provide post-mortem analysis of project management StartupObject(1) 0 startuptask 3 Aggregator(1), Simulator (4) transfer a Simulator Simulator(1) 4 simulatetask simulatetask Aggregator(1), Simulator(1), Simulator(2) 35 Simulator(1) 36 Aggregator(1), Simulator(2), Simulator(2) transfer a Simulator 37 simulatetask Aggregator(1), Simulator(3), Simulator(1) 67 simulatetask 1 Simulator StartupObject 1 99 Aggregator(1), Simulator(4) Simulator aggregatetask 1 1 Aggregator Aggregator(1), Simulator(3) 101 Simulator aggregatetask 1 Aggregator(1), Simulator(2) 103 aggregatetask Simulator Aggregator(1), Simulator(1) 105 aggregatetask 107 empty

  43. As-Built Critical Path Analysis core 0 core 1 StartupObject(1) 0 startuptask • Compute the time when a task invocation’s data dependences are resolved 0 3 Aggregator(1), Simulator (4) transfer a Simulator Simulator(1) 4 simulatetask 3 simulatetask Aggregator(1), Simulator(1), Simulator(2) 35 Simulator(1) 36 transfer a Simulator Aggregator(1), Simulator(2), Simulator(2) 37 simulatetask 3 Aggregator(1), Simulator(3), Simulator(1) 67 simulatetask 3 99 Aggregator(1), Simulator(4) aggregatetask 35 Aggregator(1), Simulator(3) 101 aggregatetask 101 Aggregator(1), Simulator(2) 103 aggregatetask 103 Aggregator(1), Simulator(1) 105 aggregatetask 105 107 empty

  44. Waiting Task Optimization • Waiting tasks: • Tasks whose real invocation time is later than the time when all its data dependences are resolved • Delayed because of resource conflicts • Bottlenecks, remove them from ABCP • Optimization • Migrate waiting tasks to spare cores • Shorten the ABCP to improve performance

  45. Critical Task Optimization • There may not exist spare cores to move waiting tasks to • Identify critical tasks: tasks that produce data that is consumed immediately • Attempt to execute critical tasks as early as possible • Migrate other tasks which blocked some critical task to other cores core 0 core 1 simulatetask Aggregator(1), Simulator(1), Simulator(2) 35 Simulator(1) 36 simulatetask simulatetask 1 Aggregator(1), Simulator(3), Simulator(1) 67 Simulator(2) simulatetask 3 2 99 Aggregator(1), Simulator(4) aggregatetask 35 Aggregator(1), Simulator(3) 101 aggregatetask

  46. Code Generator Profile Data Bamboo Program Processor Specification Bamboo Compiler Implementation Generator Candidate implementations Simulation-based Evaluator Leading implementations Implementation Optimizer Optimized implementation Tuned implementations Code Generator Intermediate C code Optimized multi-core binary Multi-core Processor

  47. Evaluation • MIT RAW simulator • Cycle accurate simulator configured for 16 cores • RAW chip: tiled chip, shared memory, on-chip network • Benchmarks: • Series: Java Grande benchmark suite • MonteCarlo: Java Grande benchmark suite • FilterBank: StreamIt benchmark suite • Fractal

  48. Speedups on 16 cores • Successfully generated implementations with good performance

  49. Comparison to Hand-Written C Code • Overhead of Bamboo: • Small for Series and Fractal • Larger overhead for MonteCarlo and FilterBank: • GCC cannot reorder instructions to fill floating-point delay slots for Bamboo implementations due to imprecise alias results • Easy to add alias information to facilitate the reordering

  50. Comparison of Estimation and Real Execution • The simulation estimations are close to the real execution time

More Related