Loading in 5 sec....

High performance computing for a family of smooth trajectories using parallel environmentsPowerPoint Presentation

High performance computing for a family of smooth trajectories using parallel environments

Download Presentation

High performance computing for a family of smooth trajectories using parallel environments

Loading in 2 Seconds...

- 253 Views
- Uploaded on
- Presentation posted in: Internet / Web

High performance computing for a family of smooth trajectories using parallel environments

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

High performance computing for a family of smooth trajectories using parallel environments

Bologna, March 23 - 26, 2004

Gianluca Argentini

Advanced Computing Laboratory

gianluca.argentini@riellogroup.com

Introduction - 1

- The company:
- products for heating and conditioning
- development and production of residential and industrial burners
- presence of a Center of Excellence for the study of combustion and flame processes
- R&D Department, extensive CAD (Catia from IBM-Dassault Systemes) and FLUENT computations

1

Introduction - 2

- Industrial and power burners have particular requirements:
- customized study of combustion head
- study of accurate geometry of combustion chamber (shape of the flame, flow of gas or oil and oxygen)
- ventilation and air circulation fans for a correct oxygen supply, right pressurization and continous cooling
- reduction of vibrations and noise

2

Introduction - 3

- Rapid prototyping for optimal shape of combustion head and combustion chamber involves Computational Fluid Dynamics:
- tracing of air or gas flow particles streamlines
- shape of the flow in a generic geometry

- High graphic resolution requires a large amount of particles paths:
- strong computational memory-expensive and cpu-based effort
- distribution of paths on a multiprocessor environment

3

The problem

Focus on numerical simulation of flows (in combustion head, chamber or in fans mechanism)

The large numerical output of simulation is generated by Navier-Stokes (use of FLUENT package) or Cellular Automaton models (MATLAB package)

- From data, we would obtain:
- path tracking of fluid particles, useful for customized design of combustion heads and chambers
- smooth 3D visualization of particles trajectories, possibly with continuous slope and curvature (analitically: class C2)

4

About problem treatment

- Step 1. The data obtained from simulation model are treated by an algorithm for the computation of algebric curves (cubic polynomials) associated to particles paths:
- block-data distribution for parallel computing
- necessity of continuous reallocation in RAM
- Step 2. Evaluation of polynomials on a large set of values for fine resolution:
- very expensive CPU computation
- sets of curves distribution on processors, no communication

Data

Algebric curves

Massive Computing

5

Fitting the trajectories

From simulation, a single particle trajectory is a set of 3D points:

- S is the number of points

- M is the number of trajectories

Interpolation of the points:

- Bezier-like is not realistic in case of twist or divergence of speeds field

- Chebychev or Least-Squares-like are too rigid for a customized application

- polinomial fitting is simple but often shows spurious effects as Runge-Gibbs phenomenon

We think a splines-based technique is more useful

6

The splines-based algorithm

Let S = 4 x N : path is divided into four-points groups

For every group the points are interpolated by three cubic polynomials imposing four analytical conditions:

- passage at Pk point, 1 £ k £ 3

- passage at Pk+1 point

- continuous slope at Pk point

- continuous curvature at Pk point

For smooth rendering and for avoiding excessive twisting of trajectories, the cubics uk are added to the Bezier curve b associated to the four points:

v = ab + buk 0 < a, b < 1

7

Finding the splines

We have choosea = b = 0.5

Let b = As3 + Bs2 + Cs + D (0 £ s £ 1) the Bezier curve of control points P1,…,P4; for every spline uk = at3 + bt2 + ct + d (0 £ t £ 1) the coefficients are computed by (2 £ k £ 3, for k = 1 the formulas are slightly different but of the same algebraic form; a, b, c, d are 3-dimensional cartesian vector)

a = Pk+1 - Pk - 3B - C - 6

b = B + 3 (1)

c = 2B + C + 3

d = Pk

8

A matrix for splines

The system (1) can be represented asc = T b (matrix-vector multiplication) where

c = (a, b, c, d)

b = (Pk+1, Pk, B, C, 1)

1 -1 -3 -1 -6

T = 0 0 1 0 3

0 0 2 1 3

0 1 0 0 0

For every spline, only the vector b is variable; for a single trajectory, it must be reassigned in RAM every group of two points, after the computation of the relative Bezier curve.

9

A global matrix for splines

If we define a global matrixÆas

T 0 . . . 0

with0as 4 x 5 zero-matrix, we have a 4M x 5Msparse matrix (optimization of memory storage in MATLAB)

0 T . . . 0

Æ=

.

.

0 0 . . 0 T

and with B = (Pk+1, Pk, B1, C1, 1, . . ., Pk+1, Pk, BM, CM, 1) we can compute for every two-points group the coefficients of cubic splines for all the M trajectories:

C = Æ B

10

Computational complexity analysis

- Every four-points group, for the M trajectories the flops (floating point operations) number for computing the splines coefficients is:
- for Bezier curves (customized Matlab script): 316M
- for Æ matrix-vector multiplication (upper estimate):324M
- We have N groups of four-points at every trajectory: the total flops number of the Step 1is about 640MN

11

A parallel distribution for splines

With P, number of processes, divisor of M,the method used is the distribution of M/P trajectories (rows of Æ matrix) to every process; no communication is involved.

The value of M is important for the occupation of RAM at every computational node.

M

pP

.

.

p2

p1

N

linear execution for every process

12

Computing splines: hardware and software

- Bezier curves and splines computation on
- Linux cluster IBM x330, biprocessor Pentium III 1.133 GHz, at CINECA (2003); C routines and MPI (for parallel startup and data distribution)
- 2 nodes Windows2000 / Linux RedHat IBM x440, biprocessor Xeon 2.4 GHz Hyper Threading, 2 GB RAM, at Riello (2003); MATLAB rel. 6.5 scripts (startup of simultaneous multi-engine)

13

Computing splines: performance results

Beowulf CINECA:

The registered speedup is quasi-linear; for high value of P the amount of data distribution (M variable) among processes is more intrusive.

X440 cluster:

Better performances of Win2k (linear speedup) - compared with Linux - with Intel HT technology

14

Post-processing for splines

- Now we would a fast method for computing the splines values in a set of parameter ticks with fine sampling.
- The CFD packages have some limits in the post-processing phase:
- resolution based on pre-processing mesh
- rigid (when possible) load distribution among available processors

For good graphic visualization, the interval between two data-points might be divided in a suitable number of ticks:

15

Valuating the splines

Let V + 1 the number of ticks for each cubic spline valuation; then the ticks are

(0, 1/ V, 2/ V, . . ., (V -1)/ V , 1)

and the values of splines parameter in the computation are their (0, 1, 2, 3)-th degree powers. The value of a cubic at t0can be view as a dot product:

at03+ bt02+ ct0 + d = (a, b, c, d)·(t03, t02, t0, 1)

0 (1/ V)3 . . . . ((V -1)/ V)3 1

LetÓthe pre-allocable constant 4x(V+1) matrix:

0 (1/ V)2 . . . . ((V -1)/ V)2 1

0 (1/ V)1 . . . . ((V -1)/ V)1 1

1 1 . . . . 1 1

16

An eulerian view

LetÂthe M x 4 matrix (each row a spline for each trajectory):

a1 b1 c1 d1

a2 b2 c2 d2

. . . .

aM bM cM dM

Then the Mx (V+1) matrix productÕ= Â Ócontains in each row the values of a cubic between two data-points, for all the M trajectories (eulerian method). For the product, the flops are 21M(V+1), the number of matrices Õ is 3N; the total number of flops are 63NM(V+1).

17

A lagrangian view

LetÏthe 3N x 4 matrix (each row a spline along one single trajectory):

a1 b1 c1 d1

a2 b2 c2 d2

. . . .

a3N b3N c3N d3N

Then the 3Nx (V+1) matrix productÔ= Ï Ócontains in each row the values of a cubic between two data-points, for a single trajectory (lagrangian method). For the product, the flops are 63N(V+1), the number of matrices Ô is M; the total number of flops are 63NM(V+1).

18

Data distribution: eulerian case

With P, number of processes, divisor of 3N (amount of two-points groups),the method used is the distribution of 3N/P Â matrices to every process; no communication is involved.

The value of N is important for the total computation time, N and M for the RAM allocation of each process.

3N

CPU

. . . . .

M

RAM

19

Data distribution: lagrangian case

With P, number of processes, divisor of M (amount of trajectories),the method used is the distribution of M/PÏ matrices to every process; no communication is involved.

The value of N is important for the total computation time, N and M for the RAM allocation of each process.

3N

RAM

.

M

CPU

20

Hardware and software

Hardware: 2 x { IBM x440, 2 Xeon 2.4 GHz HT, 2 GB }, at Riello

Software: Windows2000 / Linux RH 8.1, MATLAB 6.5, parallelism of simultaneous Matlab engines

- for matrix multiplication, Matlab 6.5 uses internal LAPACK Level 3 BLAS routines (good performances)
- the Ó matrix is computed only one time (in case of uniform and costant sampling interval), its values are probably always cached during matrices multiplication

21

Performance results

Performances of a single Matlab process for the Â Óproduct with V = 100; as theory, the execution time is linear on M variable.

Performances of multiprocess products (case 3N = 4200P); for P £ 8, the total computation time depends on NM (Gustafson law), as expected.

22

Performance results: considerations

- Linear speedup until P=8 (= number of virtual Hyper Threads processors); for P³8 reallocations of RAM and caches have a negative effect

- For large data sets, the amount of RAM in the nodes of cluster is a critical factor, while the CPUs performances are good with the use of LAPACK routines

- First results with a technique using “global M-N” matrices, an MPI-multithreads version of MATLAB (Cornell Toolbox), and parallel matrix multiplication algorithms, show an overhead, in case of large data, due to communications

23

Performance results: Hyper Threading

Performance of Intel Hyper Threading Technology of Xeon processors; the vertical unit is time execution in the case of 8 processes (M=5000,3N = 4200P); until 8, the time seems to be quadratic on processes number.

- Similar results have been obtained
- using Win2k or Linux
- using High Performance Linpack benchmarking

24

Examples

red = trajectory computation with V = 100; black = least squares method, 3° degree polynomials; gray = data-points from simulation

Forced injection of air in combustion head; the ribbons show some particles trajectories; data-points from simulation, paths computation with V=100, M=5000, N=1600, P=8; computation and rendering by Matlab; total computation time 85 secs

Thanks