High performance computing for a family of smooth trajectories using parallel environments
Download
1 / 26

Nessun titolo diapositiva - PowerPoint PPT Presentation


  • 263 Views
  • Uploaded on

High performance computing for a family of smooth trajectories using parallel environments. Bologna, March 23 - 26, 2004. Gianluca Argentini. Advanced Computing Laboratory. [email protected] Introduction - 1. The company :

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Nessun titolo diapositiva' - Donna


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Slide1 l.jpg

High performance computing for a family of smooth trajectories using parallel environments

Bologna, March 23 - 26, 2004

Gianluca Argentini

Advanced Computing Laboratory

[email protected]


Slide2 l.jpg

Introduction trajectories using parallel environments - 1

  • The company:

  • products for heating and conditioning

  • development and production of residential and industrial burners

  • presence of a Center of Excellence for the study of combustion and flame processes

  • R&D Department, extensive CAD (Catia from IBM-Dassault Systemes) and FLUENT computations

1


Slide3 l.jpg

Introduction trajectories using parallel environments - 2

  • Industrial and power burners have particular requirements:

  • customized study of combustion head

  • study of accurate geometry of combustion chamber (shape of the flame, flow of gas or oil and oxygen)

  • ventilation and air circulation fans for a correct oxygen supply, right pressurization and continous cooling

  • reduction of vibrations and noise

2


Slide4 l.jpg

Introduction trajectories using parallel environments - 3

  • Rapid prototyping for optimal shape of combustion head and combustion chamber involves Computational Fluid Dynamics:

  • tracing of air or gas flow particles streamlines

  • shape of the flow in a generic geometry

  • High graphic resolution requires a large amount of particles paths:

  • strong computational memory-expensive and cpu-based effort

  • distribution of paths on a multiprocessor environment

3


Slide5 l.jpg

The problem trajectories using parallel environments

Focus on numerical simulation of flows (in combustion head, chamber or in fans mechanism)

The large numerical output of simulation is generated by Navier-Stokes (use of FLUENT package) or Cellular Automaton models (MATLAB package)

  • From data, we would obtain:

  • path tracking of fluid particles, useful for customized design of combustion heads and chambers

  • smooth 3D visualization of particles trajectories, possibly with continuous slope and curvature (analitically: class C2)

4


Slide6 l.jpg

About problem treatment trajectories using parallel environments

  • Step 1. The data obtained from simulation model are treated by an algorithm for the computation of algebric curves (cubic polynomials) associated to particles paths:

  • block-data distribution for parallel computing

  • necessity of continuous reallocation in RAM

  • Step 2. Evaluation of polynomials on a large set of values for fine resolution:

  • very expensive CPU computation

  • sets of curves distribution on processors, no communication

Data

Algebric curves

Massive Computing

5


Slide7 l.jpg

Fitting the trajectories trajectories using parallel environments

From simulation, a single particle trajectory is a set of 3D points:

  • S is the number of points

  • M is the number of trajectories

Interpolation of the points:

  • Bezier-like is not realistic in case of twist or divergence of speeds field

  • Chebychev or Least-Squares-like are too rigid for a customized application

  • polinomial fitting is simple but often shows spurious effects as Runge-Gibbs phenomenon

We think a splines-based technique is more useful

6


Slide8 l.jpg

The splines-based algorithm trajectories using parallel environments

Let S = 4 x N : path is divided into four-points groups

For every group the points are interpolated by three cubic polynomials imposing four analytical conditions:

  • passage at Pk point, 1 £ k £ 3

  • passage at Pk+1 point

  • continuous slope at Pk point

  • continuous curvature at Pk point

For smooth rendering and for avoiding excessive twisting of trajectories, the cubics uk are added to the Bezier curve b associated to the four points:

v = ab + buk 0 < a, b < 1

7


Slide9 l.jpg

Finding the splines trajectories using parallel environments

We have choosea = b = 0.5

Let b = As3 + Bs2 + Cs + D (0 £ s £ 1) the Bezier curve of control points P1,…,P4; for every spline uk = at3 + bt2 + ct + d (0 £ t £ 1) the coefficients are computed by (2 £ k £ 3, for k = 1 the formulas are slightly different but of the same algebraic form; a, b, c, d are 3-dimensional cartesian vector)

a = Pk+1 - Pk - 3B - C - 6

b = B + 3 (1)

c = 2B + C + 3

d = Pk

8


Slide10 l.jpg

A matrix for splines trajectories using parallel environments

The system (1) can be represented asc = T b (matrix-vector multiplication) where

c = (a, b, c, d)

b = (Pk+1, Pk, B, C, 1)

1 -1 -3 -1 -6

T = 0 0 1 0 3

0 0 2 1 3

0 1 0 0 0

For every spline, only the vector b is variable; for a single trajectory, it must be reassigned in RAM every group of two points, after the computation of the relative Bezier curve.

9


Slide11 l.jpg

A global matrix for splines trajectories using parallel environments

If we define a global matrixÆas

T 0 . . . 0

with0as 4 x 5 zero-matrix, we have a 4M x 5Msparse matrix (optimization of memory storage in MATLAB)

0 T . . . 0

Æ=

.

.

0 0 . . 0 T

and with B = (Pk+1, Pk, B1, C1, 1, . . ., Pk+1, Pk, BM, CM, 1) we can compute for every two-points group the coefficients of cubic splines for all the M trajectories:

C = Æ B

10


Slide12 l.jpg

Computational complexity analysis trajectories using parallel environments

  • Every four-points group, for the M trajectories the flops (floating point operations) number for computing the splines coefficients is:

  • for Bezier curves (customized Matlab script): 316M

  • for Æ matrix-vector multiplication (upper estimate): 324M

  • We have N groups of four-points at every trajectory: the total flops number of the Step 1is about 640MN

11


Slide13 l.jpg

A parallel distribution for splines trajectories using parallel environments

With P, number of processes, divisor of M,the method used is the distribution of M/P trajectories (rows of Æ matrix) to every process; no communication is involved.

The value of M is important for the occupation of RAM at every computational node.

M

pP

.

.

p2

p1

N

linear execution for every process

12


Slide14 l.jpg

Computing splines: hardware and software trajectories using parallel environments

  • Bezier curves and splines computation on

  • Linux cluster IBM x330, biprocessor Pentium III 1.133 GHz, at CINECA (2003); C routines and MPI (for parallel startup and data distribution)

  • 2 nodes Windows2000 / Linux RedHat IBM x440, biprocessor Xeon 2.4 GHz Hyper Threading, 2 GB RAM, at Riello (2003); MATLAB rel. 6.5 scripts (startup of simultaneous multi-engine)

13


Slide15 l.jpg

Computing splines: performance results trajectories using parallel environments

Beowulf CINECA:

The registered speedup is quasi-linear; for high value of P the amount of data distribution (M variable) among processes is more intrusive.

X440 cluster:

Better performances of Win2k (linear speedup) - compared with Linux - with Intel HT technology

14


Slide16 l.jpg

Post-processing for splines trajectories using parallel environments

  • Now we would a fast method for computing the splines values in a set of parameter ticks with fine sampling.

  • The CFD packages have some limits in the post-processing phase:

  • resolution based on pre-processing mesh

  • rigid (when possible) load distribution among available processors

For good graphic visualization, the interval between two data-points might be divided in a suitable number of ticks:

15


Slide17 l.jpg

Valuating the splines trajectories using parallel environments

Let V + 1 the number of ticks for each cubic spline valuation; then the ticks are

(0, 1/ V, 2/ V, . . ., (V -1)/ V , 1)

and the values of splines parameter in the computation are their (0, 1, 2, 3)-th degree powers. The value of a cubic at t0can be view as a dot product:

at03+ bt02+ ct0 + d = (a, b, c, d)·(t03, t02, t0, 1)

0 (1/ V)3 . . . . ((V -1)/ V)3 1

LetÓthe pre-allocable constant 4x(V+1) matrix:

0 (1/ V)2 . . . . ((V -1)/ V)2 1

0 (1/ V)1 . . . . ((V -1)/ V)1 1

1 1 . . . . 1 1

16


Slide18 l.jpg

An eulerian view trajectories using parallel environments

LetÂthe M x 4 matrix (each row a spline for each trajectory):

a1 b1 c1 d1

a2 b2 c2 d2

. . . .

aM bM cM dM

Then the Mx (V+1) matrix productÕ= Â Ócontains in each row the values of a cubic between two data-points, for all the M trajectories (eulerian method). For the product, the flops are 21M(V+1), the number of matrices Õ is 3N; the total number of flops are 63NM(V+1).

17


Slide19 l.jpg

A lagrangian view trajectories using parallel environments

LetÏthe 3N x 4 matrix (each row a spline along one single trajectory):

a1 b1 c1 d1

a2 b2 c2 d2

. . . .

a3N b3N c3N d3N

Then the 3Nx (V+1) matrix productÔ= Ï Ócontains in each row the values of a cubic between two data-points, for a single trajectory (lagrangian method). For the product, the flops are 63N(V+1), the number of matrices Ô is M; the total number of flops are 63NM(V+1).

18


Slide20 l.jpg

Data distribution: eulerian case trajectories using parallel environments

With P, number of processes, divisor of 3N (amount of two-points groups),the method used is the distribution of 3N/P Â matrices to every process; no communication is involved.

The value of N is important for the total computation time, N and M for the RAM allocation of each process.

3N

CPU

. . . . .

M

RAM

19


Slide21 l.jpg

Data distribution: lagrangian case trajectories using parallel environments

With P, number of processes, divisor of M (amount of trajectories),the method used is the distribution of M/PÏ matrices to every process; no communication is involved.

The value of N is important for the total computation time, N and M for the RAM allocation of each process.

3N

RAM

.

M

CPU

20


Slide22 l.jpg

Hardware and software trajectories using parallel environments

Hardware: 2 x { IBM x440, 2 Xeon 2.4 GHz HT, 2 GB }, at Riello

Software: Windows2000 / Linux RH 8.1, MATLAB 6.5, parallelism of simultaneous Matlab engines

  • for matrix multiplication, Matlab 6.5 uses internal LAPACK Level 3 BLAS routines (good performances)

  • the Ó matrix is computed only one time (in case of uniform and costant sampling interval), its values are probably always cached during matrices multiplication

21


Slide23 l.jpg

Performance results trajectories using parallel environments

Performances of a single Matlab process for the  Óproduct with V = 100; as theory, the execution time is linear on M variable.

Performances of multiprocess products (case 3N = 4200P); for P £ 8, the total computation time depends on NM (Gustafson law), as expected.

22


Slide24 l.jpg

Performance results: considerations trajectories using parallel environments

  • Linear speedup until P=8 (= number of virtual Hyper Threads processors); for P³8 reallocations of RAM and caches have a negative effect

  • For large data sets, the amount of RAM in the nodes of cluster is a critical factor, while the CPUs performances are good with the use of LAPACK routines

  • First results with a technique using “global M-N” matrices, an MPI-multithreads version of MATLAB (Cornell Toolbox), and parallel matrix multiplication algorithms, show an overhead, in case of large data, due to communications

23


Slide25 l.jpg

Performance results: Hyper Threading trajectories using parallel environments

Performance of Intel Hyper Threading Technology of Xeon processors; the vertical unit is time execution in the case of 8 processes (M=5000,3N = 4200P); until 8, the time seems to be quadratic on processes number.

  • Similar results have been obtained

  • using Win2k or Linux

  • using High Performance Linpack benchmarking

24


Slide26 l.jpg

Examples trajectories using parallel environments

red = trajectory computation with V = 100; black = least squares method, 3° degree polynomials; gray = data-points from simulation

Forced injection of air in combustion head; the ribbons show some particles trajectories; data-points from simulation, paths computation with V=100, M=5000, N=1600, P=8; computation and rendering by Matlab; total computation time 85 secs

Thanks


ad