slide1 l.
Skip this Video
Download Presentation
High performance computing for a family of smooth trajectories using parallel environments

Loading in 2 Seconds...

play fullscreen
1 / 26

High performance computing for a family of smooth trajectories using parallel environments - PowerPoint PPT Presentation

  • Uploaded on

High performance computing for a family of smooth trajectories using parallel environments. Bologna, March 23 - 26, 2004. Gianluca Argentini. Advanced Computing Laboratory. Introduction - 1. The company :

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'High performance computing for a family of smooth trajectories using parallel environments' - Donna

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

High performance computing for a family of smooth trajectories using parallel environments

Bologna, March 23 - 26, 2004

Gianluca Argentini

Advanced Computing Laboratory


Introduction - 1

  • The company:
  • products for heating and conditioning
  • development and production of residential and industrial burners
  • presence of a Center of Excellence for the study of combustion and flame processes
  • R&D Department, extensive CAD (Catia from IBM-Dassault Systemes) and FLUENT computations



Introduction - 2

  • Industrial and power burners have particular requirements:
  • customized study of combustion head
  • study of accurate geometry of combustion chamber (shape of the flame, flow of gas or oil and oxygen)
  • ventilation and air circulation fans for a correct oxygen supply, right pressurization and continous cooling
  • reduction of vibrations and noise



Introduction - 3

  • Rapid prototyping for optimal shape of combustion head and combustion chamber involves Computational Fluid Dynamics:
  • tracing of air or gas flow particles streamlines
  • shape of the flow in a generic geometry
  • High graphic resolution requires a large amount of particles paths:
  • strong computational memory-expensive and cpu-based effort
  • distribution of paths on a multiprocessor environment



The problem

Focus on numerical simulation of flows (in combustion head, chamber or in fans mechanism)

The large numerical output of simulation is generated by Navier-Stokes (use of FLUENT package) or Cellular Automaton models (MATLAB package)

  • From data, we would obtain:
  • path tracking of fluid particles, useful for customized design of combustion heads and chambers
  • smooth 3D visualization of particles trajectories, possibly with continuous slope and curvature (analitically: class C2)



About problem treatment

  • Step 1. The data obtained from simulation model are treated by an algorithm for the computation of algebric curves (cubic polynomials) associated to particles paths:
  • block-data distribution for parallel computing
  • necessity of continuous reallocation in RAM
  • Step 2. Evaluation of polynomials on a large set of values for fine resolution:
  • very expensive CPU computation
  • sets of curves distribution on processors, no communication


Algebric curves

Massive Computing



Fitting the trajectories

From simulation, a single particle trajectory is a set of 3D points:

  • S is the number of points
  • M is the number of trajectories

Interpolation of the points:

  • Bezier-like is not realistic in case of twist or divergence of speeds field
  • Chebychev or Least-Squares-like are too rigid for a customized application
  • polinomial fitting is simple but often shows spurious effects as Runge-Gibbs phenomenon

We think a splines-based technique is more useful



The splines-based algorithm

Let S = 4 x N : path is divided into four-points groups

For every group the points are interpolated by three cubic polynomials imposing four analytical conditions:

  • passage at Pk point, 1 £ k £ 3
  • passage at Pk+1 point
  • continuous slope at Pk point
  • continuous curvature at Pk point

For smooth rendering and for avoiding excessive twisting of trajectories, the cubics uk are added to the Bezier curve b associated to the four points:

v = ab + buk 0 < a, b < 1



Finding the splines

We have choosea = b = 0.5

Let b = As3 + Bs2 + Cs + D (0 £ s £ 1) the Bezier curve of control points P1,…,P4; for every spline uk = at3 + bt2 + ct + d (0 £ t £ 1) the coefficients are computed by (2 £ k £ 3, for k = 1 the formulas are slightly different but of the same algebraic form; a, b, c, d are 3-dimensional cartesian vector)

a = Pk+1 - Pk - 3B - C - 6

b = B + 3 (1)

c = 2B + C + 3

d = Pk



A matrix for splines

The system (1) can be represented asc = T b (matrix-vector multiplication) where

c = (a, b, c, d)

b = (Pk+1, Pk, B, C, 1)

1 -1 -3 -1 -6

T = 0 0 1 0 3

0 0 2 1 3

0 1 0 0 0

For every spline, only the vector b is variable; for a single trajectory, it must be reassigned in RAM every group of two points, after the computation of the relative Bezier curve.



A global matrix for splines

If we define a global matrixÆas

T 0 . . . 0

with0as 4 x 5 zero-matrix, we have a 4M x 5Msparse matrix (optimization of memory storage in MATLAB)

0 T . . . 0




0 0 . . 0 T

and with B = (Pk+1, Pk, B1, C1, 1, . . ., Pk+1, Pk, BM, CM, 1) we can compute for every two-points group the coefficients of cubic splines for all the M trajectories:

C = Æ B



Computational complexity analysis

  • Every four-points group, for the M trajectories the flops (floating point operations) number for computing the splines coefficients is:
  • for Bezier curves (customized Matlab script): 316M
  • for Æ matrix-vector multiplication (upper estimate): 324M
  • We have N groups of four-points at every trajectory: the total flops number of the Step 1is about 640MN



A parallel distribution for splines

With P, number of processes, divisor of M,the method used is the distribution of M/P trajectories (rows of Æ matrix) to every process; no communication is involved.

The value of M is important for the occupation of RAM at every computational node.








linear execution for every process



Computing splines: hardware and software

  • Bezier curves and splines computation on
  • Linux cluster IBM x330, biprocessor Pentium III 1.133 GHz, at CINECA (2003); C routines and MPI (for parallel startup and data distribution)
  • 2 nodes Windows2000 / Linux RedHat IBM x440, biprocessor Xeon 2.4 GHz Hyper Threading, 2 GB RAM, at Riello (2003); MATLAB rel. 6.5 scripts (startup of simultaneous multi-engine)



Computing splines: performance results

Beowulf CINECA:

The registered speedup is quasi-linear; for high value of P the amount of data distribution (M variable) among processes is more intrusive.

X440 cluster:

Better performances of Win2k (linear speedup) - compared with Linux - with Intel HT technology



Post-processing for splines

  • Now we would a fast method for computing the splines values in a set of parameter ticks with fine sampling.
  • The CFD packages have some limits in the post-processing phase:
  • resolution based on pre-processing mesh
  • rigid (when possible) load distribution among available processors

For good graphic visualization, the interval between two data-points might be divided in a suitable number of ticks:



Valuating the splines

Let V + 1 the number of ticks for each cubic spline valuation; then the ticks are

(0, 1/ V, 2/ V, . . ., (V -1)/ V , 1)

and the values of splines parameter in the computation are their (0, 1, 2, 3)-th degree powers. The value of a cubic at t0can be view as a dot product:

at03+ bt02+ ct0 + d = (a, b, c, d)·(t03, t02, t0, 1)

0 (1/ V)3 . . . . ((V -1)/ V)3 1

LetÓthe pre-allocable constant 4x(V+1) matrix:

0 (1/ V)2 . . . . ((V -1)/ V)2 1

0 (1/ V)1 . . . . ((V -1)/ V)1 1

1 1 . . . . 1 1



An eulerian view

LetÂthe M x 4 matrix (each row a spline for each trajectory):

a1 b1 c1 d1

a2 b2 c2 d2

. . . .

aM bM cM dM

Then the Mx (V+1) matrix productÕ= Â Ócontains in each row the values of a cubic between two data-points, for all the M trajectories (eulerian method). For the product, the flops are 21M(V+1), the number of matrices Õ is 3N; the total number of flops are 63NM(V+1).



A lagrangian view

LetÏthe 3N x 4 matrix (each row a spline along one single trajectory):

a1 b1 c1 d1

a2 b2 c2 d2

. . . .

a3N b3N c3N d3N

Then the 3Nx (V+1) matrix productÔ= Ï Ócontains in each row the values of a cubic between two data-points, for a single trajectory (lagrangian method). For the product, the flops are 63N(V+1), the number of matrices Ô is M; the total number of flops are 63NM(V+1).



Data distribution: eulerian case

With P, number of processes, divisor of 3N (amount of two-points groups),the method used is the distribution of 3N/P Â matrices to every process; no communication is involved.

The value of N is important for the total computation time, N and M for the RAM allocation of each process.



. . . . .





Data distribution: lagrangian case

With P, number of processes, divisor of M (amount of trajectories),the method used is the distribution of M/PÏ matrices to every process; no communication is involved.

The value of N is important for the total computation time, N and M for the RAM allocation of each process.








Hardware and software

Hardware: 2 x { IBM x440, 2 Xeon 2.4 GHz HT, 2 GB }, at Riello

Software: Windows2000 / Linux RH 8.1, MATLAB 6.5, parallelism of simultaneous Matlab engines

  • for matrix multiplication, Matlab 6.5 uses internal LAPACK Level 3 BLAS routines (good performances)
  • the Ó matrix is computed only one time (in case of uniform and costant sampling interval), its values are probably always cached during matrices multiplication



Performance results

Performances of a single Matlab process for the  Óproduct with V = 100; as theory, the execution time is linear on M variable.

Performances of multiprocess products (case 3N = 4200P); for P £ 8, the total computation time depends on NM (Gustafson law), as expected.



Performance results: considerations

  • Linear speedup until P=8 (= number of virtual Hyper Threads processors); for P³8 reallocations of RAM and caches have a negative effect
  • For large data sets, the amount of RAM in the nodes of cluster is a critical factor, while the CPUs performances are good with the use of LAPACK routines
  • First results with a technique using “global M-N” matrices, an MPI-multithreads version of MATLAB (Cornell Toolbox), and parallel matrix multiplication algorithms, show an overhead, in case of large data, due to communications



Performance results: Hyper Threading

Performance of Intel Hyper Threading Technology of Xeon processors; the vertical unit is time execution in the case of 8 processes (M=5000,3N = 4200P); until 8, the time seems to be quadratic on processes number.

  • Similar results have been obtained
  • using Win2k or Linux
  • using High Performance Linpack benchmarking




red = trajectory computation with V = 100; black = least squares method, 3° degree polynomials; gray = data-points from simulation

Forced injection of air in combustion head; the ribbons show some particles trajectories; data-points from simulation, paths computation with V=100, M=5000, N=1600, P=8; computation and rendering by Matlab; total computation time 85 secs