1 / 26

High performance computing for a family of smooth trajectories using parallel environments

High performance computing for a family of smooth trajectories using parallel environments. Bologna, March 23 - 26, 2004. Gianluca Argentini. Advanced Computing Laboratory. gianluca.argentini@riellogroup.com. Introduction - 1. The company :

Donna
Download Presentation

High performance computing for a family of smooth trajectories using parallel environments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High performance computing for a family of smooth trajectories using parallel environments Bologna, March 23 - 26, 2004 Gianluca Argentini Advanced Computing Laboratory gianluca.argentini@riellogroup.com

  2. Introduction - 1 • The company: • products for heating and conditioning • development and production of residential and industrial burners • presence of a Center of Excellence for the study of combustion and flame processes • R&D Department, extensive CAD (Catia from IBM-Dassault Systemes) and FLUENT computations 1

  3. Introduction - 2 • Industrial and power burners have particular requirements: • customized study of combustion head • study of accurate geometry of combustion chamber (shape of the flame, flow of gas or oil and oxygen) • ventilation and air circulation fans for a correct oxygen supply, right pressurization and continous cooling • reduction of vibrations and noise 2

  4. Introduction - 3 • Rapid prototyping for optimal shape of combustion head and combustion chamber involves Computational Fluid Dynamics: • tracing of air or gas flow particles streamlines • shape of the flow in a generic geometry • High graphic resolution requires a large amount of particles paths: • strong computational memory-expensive and cpu-based effort • distribution of paths on a multiprocessor environment 3

  5. The problem Focus on numerical simulation of flows (in combustion head, chamber or in fans mechanism) The large numerical output of simulation is generated by Navier-Stokes (use of FLUENT package) or Cellular Automaton models (MATLAB package) • From data, we would obtain: • path tracking of fluid particles, useful for customized design of combustion heads and chambers • smooth 3D visualization of particles trajectories, possibly with continuous slope and curvature (analitically: class C2) 4

  6. About problem treatment • Step 1. The data obtained from simulation model are treated by an algorithm for the computation of algebric curves (cubic polynomials) associated to particles paths: • block-data distribution for parallel computing • necessity of continuous reallocation in RAM • Step 2. Evaluation of polynomials on a large set of values for fine resolution: • very expensive CPU computation • sets of curves distribution on processors, no communication Data Algebric curves Massive Computing 5

  7. Fitting the trajectories From simulation, a single particle trajectory is a set of 3D points: • S is the number of points • M is the number of trajectories Interpolation of the points: • Bezier-like is not realistic in case of twist or divergence of speeds field • Chebychev or Least-Squares-like are too rigid for a customized application • polinomial fitting is simple but often shows spurious effects as Runge-Gibbs phenomenon We think a splines-based technique is more useful 6

  8. The splines-based algorithm Let S = 4 x N : path is divided into four-points groups For every group the points are interpolated by three cubic polynomials imposing four analytical conditions: • passage at Pk point, 1 £ k £ 3 • passage at Pk+1 point • continuous slope at Pk point • continuous curvature at Pk point For smooth rendering and for avoiding excessive twisting of trajectories, the cubics uk are added to the Bezier curve b associated to the four points: v = ab + buk 0 < a, b < 1 7

  9. Finding the splines We have choosea = b = 0.5 Let b = As3 + Bs2 + Cs + D (0 £ s £ 1) the Bezier curve of control points P1,…,P4; for every spline uk = at3 + bt2 + ct + d (0 £ t £ 1) the coefficients are computed by (2 £ k £ 3, for k = 1 the formulas are slightly different but of the same algebraic form; a, b, c, d are 3-dimensional cartesian vector) a = Pk+1 - Pk - 3B - C - 6 b = B + 3 (1) c = 2B + C + 3 d = Pk 8

  10. A matrix for splines The system (1) can be represented asc = T b (matrix-vector multiplication) where c = (a, b, c, d) b = (Pk+1, Pk, B, C, 1) 1 -1 -3 -1 -6 T = 0 0 1 0 3 0 0 2 1 3 0 1 0 0 0 For every spline, only the vector b is variable; for a single trajectory, it must be reassigned in RAM every group of two points, after the computation of the relative Bezier curve. 9

  11. A global matrix for splines If we define a global matrixÆas T 0 . . . 0 with0as 4 x 5 zero-matrix, we have a 4M x 5Msparse matrix (optimization of memory storage in MATLAB) 0 T . . . 0 Æ= . . 0 0 . . 0 T and with B = (Pk+1, Pk, B1, C1, 1, . . ., Pk+1, Pk, BM, CM, 1) we can compute for every two-points group the coefficients of cubic splines for all the M trajectories: C = Æ B 10

  12. Computational complexity analysis • Every four-points group, for the M trajectories the flops (floating point operations) number for computing the splines coefficients is: • for Bezier curves (customized Matlab script): 316M • for Æ matrix-vector multiplication (upper estimate): 324M • We have N groups of four-points at every trajectory: the total flops number of the Step 1is about 640MN 11

  13. A parallel distribution for splines With P, number of processes, divisor of M,the method used is the distribution of M/P trajectories (rows of Æ matrix) to every process; no communication is involved. The value of M is important for the occupation of RAM at every computational node. M pP . . p2 p1 N linear execution for every process 12

  14. Computing splines: hardware and software • Bezier curves and splines computation on • Linux cluster IBM x330, biprocessor Pentium III 1.133 GHz, at CINECA (2003); C routines and MPI (for parallel startup and data distribution) • 2 nodes Windows2000 / Linux RedHat IBM x440, biprocessor Xeon 2.4 GHz Hyper Threading, 2 GB RAM, at Riello (2003); MATLAB rel. 6.5 scripts (startup of simultaneous multi-engine) 13

  15. Computing splines: performance results Beowulf CINECA: The registered speedup is quasi-linear; for high value of P the amount of data distribution (M variable) among processes is more intrusive. X440 cluster: Better performances of Win2k (linear speedup) - compared with Linux - with Intel HT technology 14

  16. Post-processing for splines • Now we would a fast method for computing the splines values in a set of parameter ticks with fine sampling. • The CFD packages have some limits in the post-processing phase: • resolution based on pre-processing mesh • rigid (when possible) load distribution among available processors For good graphic visualization, the interval between two data-points might be divided in a suitable number of ticks: 15

  17. Valuating the splines Let V + 1 the number of ticks for each cubic spline valuation; then the ticks are (0, 1/ V, 2/ V, . . ., (V -1)/ V , 1) and the values of splines parameter in the computation are their (0, 1, 2, 3)-th degree powers. The value of a cubic at t0can be view as a dot product: at03+ bt02+ ct0 + d = (a, b, c, d)·(t03, t02, t0, 1) 0 (1/ V)3 . . . . ((V -1)/ V)3 1 LetÓthe pre-allocable constant 4x(V+1) matrix: 0 (1/ V)2 . . . . ((V -1)/ V)2 1 0 (1/ V)1 . . . . ((V -1)/ V)1 1 1 1 . . . . 1 1 16

  18. An eulerian view LetÂthe M x 4 matrix (each row a spline for each trajectory): a1 b1 c1 d1 a2 b2 c2 d2 . . . . aM bM cM dM Then the Mx (V+1) matrix productÕ= Â Ócontains in each row the values of a cubic between two data-points, for all the M trajectories (eulerian method). For the product, the flops are 21M(V+1), the number of matrices Õ is 3N; the total number of flops are 63NM(V+1). 17

  19. A lagrangian view LetÏthe 3N x 4 matrix (each row a spline along one single trajectory): a1 b1 c1 d1 a2 b2 c2 d2 . . . . a3N b3N c3N d3N Then the 3Nx (V+1) matrix productÔ= Ï Ócontains in each row the values of a cubic between two data-points, for a single trajectory (lagrangian method). For the product, the flops are 63N(V+1), the number of matrices Ô is M; the total number of flops are 63NM(V+1). 18

  20. Data distribution: eulerian case With P, number of processes, divisor of 3N (amount of two-points groups),the method used is the distribution of 3N/P Â matrices to every process; no communication is involved. The value of N is important for the total computation time, N and M for the RAM allocation of each process. 3N CPU . . . . . M RAM 19

  21. Data distribution: lagrangian case With P, number of processes, divisor of M (amount of trajectories),the method used is the distribution of M/PÏ matrices to every process; no communication is involved. The value of N is important for the total computation time, N and M for the RAM allocation of each process. 3N RAM . M CPU 20

  22. Hardware and software Hardware: 2 x { IBM x440, 2 Xeon 2.4 GHz HT, 2 GB }, at Riello Software: Windows2000 / Linux RH 8.1, MATLAB 6.5, parallelism of simultaneous Matlab engines • for matrix multiplication, Matlab 6.5 uses internal LAPACK Level 3 BLAS routines (good performances) • the Ó matrix is computed only one time (in case of uniform and costant sampling interval), its values are probably always cached during matrices multiplication 21

  23. Performance results Performances of a single Matlab process for the  Óproduct with V = 100; as theory, the execution time is linear on M variable. Performances of multiprocess products (case 3N = 4200P); for P £ 8, the total computation time depends on NM (Gustafson law), as expected. 22

  24. Performance results: considerations • Linear speedup until P=8 (= number of virtual Hyper Threads processors); for P³8 reallocations of RAM and caches have a negative effect • For large data sets, the amount of RAM in the nodes of cluster is a critical factor, while the CPUs performances are good with the use of LAPACK routines • First results with a technique using “global M-N” matrices, an MPI-multithreads version of MATLAB (Cornell Toolbox), and parallel matrix multiplication algorithms, show an overhead, in case of large data, due to communications 23

  25. Performance results: Hyper Threading Performance of Intel Hyper Threading Technology of Xeon processors; the vertical unit is time execution in the case of 8 processes (M=5000,3N = 4200P); until 8, the time seems to be quadratic on processes number. • Similar results have been obtained • using Win2k or Linux • using High Performance Linpack benchmarking 24

  26. Examples red = trajectory computation with V = 100; black = least squares method, 3° degree polynomials; gray = data-points from simulation Forced injection of air in combustion head; the ribbons show some particles trajectories; data-points from simulation, paths computation with V=100, M=5000, N=1600, P=8; computation and rendering by Matlab; total computation time 85 secs Thanks

More Related