1 / 44

Automatic Parallelization of Simulation Code from Equation Based Simulation Languages

Automatic Parallelization of Simulation Code from Equation Based Simulation Languages. Peter Aronsson, Industrial phd student, PELAB SaS IDA Linköping University, Sweden Based on Licentiate presentation & CPC’03 Presentation. Outline. Introduction Task Graphs

posy
Download Presentation

Automatic Parallelization of Simulation Code from Equation Based Simulation Languages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Parallelization of Simulation Code from Equation Based Simulation Languages Peter Aronsson, Industrial phd student, PELAB SaS IDA Linköping University, Sweden Based on Licentiate presentation & CPC’03 Presentation Peter Aronsson

  2. Outline • Introduction • Task Graphs • Related work on Scheduling & Clustering • Parallelization Tool • Contributions • Results • Conclusion & Future Work Peter Aronsson

  3. Introduction • Modelica • Object Oriented, Equation Based, Modeling Language • Modelica enable modeling and simulation of large and complex multi-domain systems • Large need for parallel computation • To decrease time of executing simulations • To make large models possible to simulate at all. • To meet hard real time demands in hardware-in-the-loop simulations Peter Aronsson

  4. Examples of large complex systems in Modelica Peter Aronsson

  5. Modelica Example - DCmotor Peter Aronsson

  6. Modelica example model DCMotor import Modelica.Electrical.Analog.Basic.*; import Modelica.Electrical.Sources.StepVoltage; Resistor R1(R=10); Inductor I1(L=0.1); EMF emf(k=5.4); Ground ground; StepVoltage step(V=10); Modelica.Mechanics.Rotational.Inertia load(J=2.25); equation connect(R1.n, I1.p); connect(I1.n, emf.p); connect(emf.n, ground.p); connect(emf.flange_b, load.flange_a); connect(step.p, R1.p); connect(step.n, ground.p); end DCMotor; Peter Aronsson

  7. Example – Flat set of Equations R1.v = -R1.n.v+R1.p.v 0 = R1.n.i+R1.p.i R1.i = R1.p.i R1.i*R1.R = R1.v I1.v = -I1.n.v+I1.p.v 0 = I1.n.i+I1.p.i I1.i = I1.p.i I1.L*I1.der(i) = I1.v emf.v =-emf.n.v+emf.p.v 0 = emf.n.i+emf.p.i emf.i = emf.p.i emf.w = emf.flange_b.der(phi) emf.k*emf.w = emf.v emf.flange_b.tau = -emf.i*emf.k ground.p.v = 0 step.v = -step.n.v+step.p.v 0 = step.n.i+step.p.i step.i = step.p.i step.signalSource.outPort.signal[1] = (if time < step.signalSource.p_startTime[1] then 0 else step.signalSource.p_height[1])+step.signalSource.p_offset[1] step.v = step.signalSource.outPort.signal[1] load.flange_a.phi = load.phi load.flange_b.phi = load.phi load.w = load.der(phi) load.a = load.der(w) load.a*load.J = load.flange_a.tau+load.flange_b.tau R1.n.v = I1.p.v I1.p.i+R1.n.i = 0 I1.n.v = emf.p.v emf.p.i+I1.n.i = 0 emf.n.v = step.n.v step.n.v = ground.p.v emf.n.i+ground.p.i+step.n.i = 0 emf.flange_b.phi = load.flange_a.phi emf.flange_b.tau+load.flange_a.tau = 0 step.p.v = R1.p.v R1.p.i+step.p.i = 0 load.flange_b.tau = 0 step.signalSource.y = step.signalSource.outPort.signal Peter Aronsson

  8. Plot of Simulation result • load.flange_a.tau • load.w Peter Aronsson

  9. Task Graphs • Directed Acyclic Graph (DAG) G = (V,E, t,c) V – Set of nodes, representing computational tasks E – Set of edges, representing communication of data between tasks t(v) – Execution cost for node v c(i,j) – Communication cost for edge (i,j) • Referred to as the delay model (macro dataflow model) Peter Aronsson

  10. 7 1 1 2 3 2 2 1 4 1 5 2 6 2 8 1 Small Task Graph Example 10 5 5 5 5 10 10 10 Peter Aronsson

  11. Task Scheduling Algorithms • Multiprocessor Scheduling Problem • For each task, assign • Starting time • Processor assignment (P1,...PN) • Goal: minimize execution time, given • Precedence constraints • Execution cost • Communication cost • Algorithms in literature • List Scheduling approaches (ERT, FLB) • Critical Path scheduling approaches (TDS, MCP) • Categories: Fixed No. of Proc, fixed c and/or t, ... Peter Aronsson

  12. Granularity • Granularity g = min(t(v))/max(c(i,j)) • Affects scheduling result • E.g. TDS works best for high values of g, i.e. low communication cost • Solutions: • Clustering algorithms • IDEA: build clusters of nodes where nodes in the same cluster are executed on the same processor • Merging algorithms • Merge tasks to increase computational cost. Peter Aronsson

  13. Task Clustering/Merging Algorithms • Task Clustering Problem: • Build clusters of nodes such that parallel time decreases • PT(n) = tlevel(n)+blevel(n) • By zeroing edges, i.e. putting several nodes into the same cluster => zero communication cost. • Literature: • Sarkars Internalization alg., Yangs DSC alg. • Task Merging Problem • Transform the Task Graph by merging nodes • Literature: E.g. Grain Packing alg. Peter Aronsson

  14. 5 2 7 1 6 2 5 2 4 1 2 1 3 2 1 2 8 1 2 1 1 2 8 1 4 1 6 2 7 1 8 1 1 2 3,6 6 2,5,6 4 7 1 3 2 Clustering v.s. Merging 10 5 5 0 5 5 5 0 0 0 10 10 10 merging 10 5 0 0 10 10 10 10 Clustered Task Graph Merged Task Graph Peter Aronsson

  15. DSC algorithm • Initially, put each node a separate cluster. • Traverse Task Graph • Merge clusters as long as Parallel Time does not increase. • Low complexity O((n+e) log n) • Previously used by Andersson in ObjectMath (PELAB) Peter Aronsson

  16. Modelica Compilation Numerical solver Modelica semantics Equation system (DAE) Opt. Rhs calculations C code Flat modelica (.mof) Structure of simulation code: for t=0;t<stopTime;t+=stepSize { x_dot[t+1] = f(x_dot[t],x[t],t); x[t+1] = ODESolver(x_dot[t+1]); } Modelica model (.mo) Peter Aronsson

  17. 0 a b c d e Optimizations on equations • Simplification of equations E.g. a=b, b=c eliminate => b • BLT transformation, i.e. topological sorting into strongly connected components (BLT = Block Lower Triangular form) • Index reduction, Index is how many times an equation needs to be differentiated in order to solve the equation system. • Mixed Mode /Inline Integration, methods of optimizing equations by reducing size of equation systems Peter Aronsson

  18. Generated C Code Content • Assignment statements • Arithmetic expressions (+,-,*,/), if-expressions • Function calls • Standard Math functions • Sin, Cos, Log • Modelica Functions • User defined, side effect free • External Modelica Functions • In External lib, written in Fortran or C • Call function for solving subsystems of equations • Linear or non-linear • Example Application • Robot simulation has 27 000 lines of generated C code Peter Aronsson

  19. Parallelization Tool Overview Model .mo Modelica Compiler Parallelizer C code Parallel C code Solver lib MPI lib C compiler C compiler Seq exe Parallel exe Peter Aronsson

  20. Parallelization Tool Internal Structure Sequential C code Parser Symbol Table Task Graph Builder Scheduler Debug & Statistics Code Generator Parallel C code Peter Aronsson

  21. +,-,* +,* foo /,- Task Graph building • First graph: corresponds to individual arithmetic operations, assignments, function calls and variable definitions in the C code • Second graph: Clusters of tasks from first task graph Example: defs a b c + d + - * * foo - / Peter Aronsson

  22. Investigated Scheduling Algorithms • Parallelization Tool • TDS (Task Duplications Scheduling Algorithm) • Pre – Clustering Method • Full Task Duplication Method • Experimental Framework (Mathematica) • ERT • DSC • TDS • Full Task Duplication Method • Task Merging approaches (Graph Rewrite Systems) Peter Aronsson

  23. Method 1:Pre Clustering algorithm • buildCluster(n:node, l:list of nodes, size:Integer) • Adds n to a new cluster • Repeatedly adds nodes until the size(cluster)=size • Children to n • One in-degree children to cluster • Siblings to n • Parents to n • Arbitrary nodes Peter Aronsson

  24. Managing cycles • When adding a node to a cluster the resulting graph might have cycles • Resulting graph when clustering a and b is cyclic since you can reach {a,b} from c • Resulting graph not a DAG • Can not use standard scheduling algorithms a c d b e Peter Aronsson

  25. Pre Clustering Results • Did not produce Speedup • Introduced far too many dependencies in resulting task graph • Sequentialized schedule • Conclusion: • For fine grained task graphs: • Need task duplication in such algorithm to succeed Peter Aronsson

  26. Method 2: Full Task Duplication • For each node:n with successor(n)={} • Put all pred(n) in one cluster • Repeat for all nodes in cluster • Rationale: If depth of graph limited, task duplication will be kept at reasonable level and cluster size reasonable small. • Works well when communication cost >> execution cost Peter Aronsson

  27. Full Task Duplication (2) • Merging clusters • Merge clusters with load balancing strategy, without increasing maximum cluster size • Merge clusters with greatest number of common nodes • Repeat (2) until number of processors requirement is met Peter Aronsson

  28. Full Task Duplication Results • Computed measurements • Execution cost of largest cluster + communication cost • Measured speedup • Executed on PC Linux cluster SCI network interface, using SCAMPI Peter Aronsson

  29. Robot Example Computed Speedup • Mixed Mode / Inline Integration With MM/II Without MM/II Peter Aronsson

  30. Thermofluid pipe executed on PC Cluster • Pressurewavedemo in Thermofluid package 50 discretization points Peter Aronsson

  31. Thermofluid pipe executed on PC Cluster • Pressurewavedemo in Thermofluid package 100 discretization points Peter Aronsson

  32. Task Merging using GRS • Idea: A set of simple rules to transform a task graph to increase its granularity (and decrease Parallel Time) • Use top level (and bottom level) as metric: • Parallel Time = max tlevel + max blevel Peter Aronsson

  33. Rule 1 • Merging a single child with only one parent. • Motivation: The merge does not decrease amount of parallelism in the task graph. And granularity can possibly increase. p p’ c Peter Aronsson

  34. Rule 2 • Merge all parents of a node together with the node itself. • Motivation: If the top level does not increase by the merge the resulting task will increase in size, potentially increasing granularity. p1 p2 pn … c’ c Peter Aronsson

  35. Rule 3 • Duplicate parentand merge into each child node • Motivation: As long as each child’s tlevel does not increase, duplicating p into the child will reduce the number of nodes and increase granularity. p … c2’ c1’ cn’ … c2 c1 cn Peter Aronsson

  36. Rule 4 • Merge siblings into a single node as long as a parameterized maximum execution cost is not exceeded. • Motivation: This rule can be useful if several small predecessor nodes exist and a larger predecessor node which prevents a complete merge. Does not guarantee decrease of PT. … p´ Pk+1 pn … p1 p2 pn c c Peter Aronsson

  37. Results – Example • Task graph from Modelica simulation code • Small example from the mechanical domain. • About 100 nodes built on expression level, originating from 84 equations & variables Peter Aronsson

  38. Result Task Merging example • B=1, L=1 Peter Aronsson

  39. Result Task Merging example • B=1, L=10 • B=1, L=100 Peter Aronsson

  40. Conclusions • Pre Clustering approach did not work well for the fine grained task graphs produced by our parallelization tool • FTD Method • Works reasonable well for some examples • However, in general: • Need for better scheduling/clustering algorithms for fine grained task graphs Peter Aronsson

  41. Conclusions (2) • Simple delay model may not be enough • More advanced model require more complex scheduling and clustering algorithms • Simulation code from equation based models • Hard to extract parallelism from • Need new optimization methods on DAE:s or ODE:s to increase parallelism Peter Aronsson

  42. Conclusions Task Merging using GRS • A task merging algorithm using GRS have been proposed • Four rules with simple patterns => fast pattern matching • Can easily be integrated in existing scheduling tools. • Successfully merges tasks considering • Bandwidth & Latency • Task duplication • Merging criterion: decrease Parallel Time, by decreasing tlevel (PT) • Tested on examples from simulation code Peter Aronsson

  43. Future Work • Designing and Implementing Better Scheduling and Clustering Algorithms • Support for more advanced task graph models • Work better for high granularity values • Try larger examples • Test on different architectures • Shared Memory machines • Dual processor machines Peter Aronsson

  44. Future Work (2) • Heterogeneous multiprocessor systems • Mixed DSP processors, RISC,CISC, etc. • Enhancing Modelica language with data parallelism • e.g. parallel loops, vector operations • Parallelize e.g. combined PDE and ODE problems in Modelica. • Using e.g. SCALAPACK for solving subsystems of linear equations. How to integrate into scheduling algorithms? Peter Aronsson

More Related