preconditioned iterative linear solvers for unstructured grids on the earth simulator
Download
Skip this Video
Download Presentation
Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

Loading in 2 Seconds...

play fullscreen
1 / 115

Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator - PowerPoint PPT Presentation


  • 71 Views
  • Uploaded on

Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator. Kengo Nakajima [email protected] Supercomputing Division, Information Technology Center, The University of Tokyo Japan Agency for Marine-Earth Science and Technology. April 22, 2008. Overview.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator' - jaron


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
preconditioned iterative linear solvers for unstructured grids on the earth simulator

Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

Kengo Nakajima

[email protected]

Supercomputing Division, Information Technology Center,

The University of Tokyo

Japan Agency for Marine-Earth Science and Technology

April 22, 2008

overview
08-APR22Overview
  • Background
    • Vector/Scalar Processors
    • GeoFEM Project
    • Earth Simulator
    • Preconditioned Iterative Linear Solvers
  • Optimization Strategy on the Earth Simulator
    • BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications
    • Matrix Assembling
  • Summary & Future Works
large scale computing
Large-Scale Computing
  • Scalar Processor
    • Big gap between clock rate and memory bandwidth.
    • Very low sustained/peak performance ratio (<10%)

08-APR22

scalar processors cpu cache memory hierarchical structure
Scalar ProcessorsCPU-Cache-Memory: Hierarchical Structure

CPU

Register

FAST

Small Capacity (MB)

Expensive

Large (100M of Transistors)

Cache

SLOW

Main Memory

Large Capacity (GB)

Cheap

08-APR22

large scale computing1
Large-Scale Computing
  • Scalar Processor
    • Big gap between clock rate and memory bandwidth.
    • Very low sustained/peak performance ratio (<10%)
  • Vector Processor
    • Very high sustained/peak performance ratio
      • e.g. 35 % for FEM applications on the Earth Simulator
    • requires …
      • very special tuning
      • sufficiently long loops (= large-scale problem size) for certain performance
    • Suitable for simple computation.

08-APR22

vector processors vector register fast memory
Vector ProcessorsVector Register & Fast Memory
  • Parallel Processing for Simple DO Loops.
  • Suitable for Simple & Large Computation

Vector Processor

do i= 1, N

A(i)= B(i) + C(i)

enddo

Vector

Register

Very

FAST

Main Memory

08-APR22

large scale computing2
Large-Scale Computing
  • Scalar Processor
    • Big gap between clock rate and memory bandwidth.
    • Very low sustained/peak performance ratio (<10%)
  • Vector Processor
    • Very high sustained/peak performance ratio
      • e.g. 35 % for FEM applications on the Earth Simulator
    • requires …
      • very special tuning
      • sufficiently long loops (= large-scale problem size) for certain performance
    • Suitable for simple computation.

08-APR22

slide8

IBM SP3

(LBNL)

Hitachi

SR11000/J2

(U.Tokyo)

1.50

9.20

0.623

8.00

.072-.076

(4.8-5.1)

.880-.973

(9.6-10.6)

.122

(8.1)

1.34

(14.5)

0.413

0.870

1.00

12.0

16.3

4.7 *

Earth

Simulator

Hitachi

SR8000/MPP

(U.Tokyo)

Peak Performance

GFLOPS

8.00

1.80

Measured Memory

BW (STREAM)

(GB/sec/PE)

26.6

2.85

Estimated

Performance/Core

GFLOPS

(% of peak)

2.31-3.24

(28.8-40.5)

.291-.347

(16.1-19.3)

Measured

Performance/Core

2.93

(36.6)

.335

(18.6)

BYTE/FLOP

3.325

1.583

Comm. BW

(GB/sec/Node)

12.3

1.60

* IBM p595

J.T.Carter

et al.

MPI Latency

(msec)

5.6-7.7

6-20

08-APR22

typical behavior
Typical Behavior …

40 % of peak

8 % of peak

IBM-SP3:

Performance is good for small problems

due to cache effect.

Earth Simulator:

Performance is good for large scale problems due to long vector length.

08-APR22

parallel computing strong scaling fixed prob size
Parallel ComputingStrong Scaling (Fixed Prob. Size)

Ideal

Ideal

Performance

Performance

PE#

PE#

IBM-SP3:

Super-scalar effect for small number

of PE’s. Performance decreases for

many PE’s due to comm. overhead.

Earth Simulator:

Performance decreases for many PE’s

due to comm. overhead and small vector length.

08-APR22

improvement of memory performance on ibm sp3 hitachi sr11000 j2
Improvement of Memory Performance on IBM SP3 ⇒ Hitachi SR11000/J2

Memory Performance

(BWDTH, latency etc.)

■ Flat-MPI/DCRS

□ Hybrid/DCRS

2.3 GHz

8.0 GB/sec

18M L3 cache/PE

IBM SP-3

POWER3

375 MHz

1.0 GB/sec

8M L2 cache/PE

SR11000/J2

POWER5+

08-APR22

my history with vector computers
08-APR22My History with Vector Computers
  • Cray-1S (1985-1988)
    • Mitsubishi Research Institute (MRI)
  • Cray-YMP (1988-1995)
    • MRI, University of Texas at Austin
  • Fujitsu VP, VPP series (1985-1999)
    • JAERI, PNC
  • NEC SX-4 (1997-2001)
    • CCSE/JAERI
  • Hitachi SR2201 (1997-2004)
    • University of Tokyo, CCSE/JAERI
  • Hitachi SR8000 (2000-2007)
    • University of Tokyo
  • Earth Simulator (2002-)
slide13
08-APR22
  • Background
    • Vector/Scalar Processors
    • GeoFEM Project
    • Earth Simulator
    • Preconditioned Iterative Linear Solvers
  • Optimization Strategy on the Earth Simulator
    • BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications
    • Matrix Assembling
  • Summary & Future Works
geofem fy 1998 2002 http geofem tokyo rist or jp
08-APR22GeoFEM: FY.1998-2002http://geofem.tokyo.rist.or.jp/
  • Parallel FEM platform for solid earth simulation.
    • parallel I/O, parallel linear solvers, parallel visualization
    • solid earth: earthquake, plate deformation, mantle/core convection, etc.
  • Part of national project by STA/MEXT for large-scale earth science simulations using the Earth Simulator.
  • Strong collaborations between HPC and natural science (solid earth) communities.
results on solid earth simulation
08-APR22Results on Solid Earth Simulation

Complicated Plate Model around Japan Islands

Magnetic Field of the Earth : MHD code

Dh=5.00

Dh=1.25

Simulation of Earthquake Generation Cycle

in Southwestern Japan

T=100

T=200

T=300

T=400

T=500

Transportation by Groundwater Flow through Heterogeneous Porous Media

TSUNAMI !!

features of fem applications 1 2
08-APR22Features of FEM applications (1/2)
  • Local “element-by-element” operations
    • sparse coefficient matrices
    • suitable for parallel computing
  • HUGE “indirect” accesses
    • IRREGULAR sparse matrices
    • memory intensive

do i= 1, N

jS= index(i-1)+1

jE= index(i)

do j= jS, jE

in = item(j)

Y(i)= Y(i) + AMAT(j)*X(in)

enddo

enddo

features of fem applications 2 2
08-APR22Features of FEM applications (2/2)
  • In parallel computation …
    • comm. with ONLY neighbors (except “dot products” etc.)
    • amount of messages are relatively small because only values on domain-boundary are exchanged.
    • communication (MPI) latency is critical
slide20
08-APR22
  • Background
    • Vector/Scalar Processors
    • GeoFEM Project
    • Earth Simulator
    • Preconditioned Iterative Linear Solvers
  • Optimization Strategy on the Earth Simulator
    • BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications
    • Matrix Assembling
  • Summary & Future Works
earth simulator es http www es jamstec go jp
08-APR22Earth Simulator (ES)http://www.es.jamstec.go.jp/
  • 640×8= 5,120 Vector Processors
    • SMP Cluster-Type Architecture
    • 8 GFLOPS/PE
    • 64 GFLOPS/Node
    • 40 TFLOPS/ES
  • 16 GB Memory/Node, 10 TB/ES
  • 640×640 Crossbar Network
    • 16 GB/sec×2
  • Memory BWTH with 32 GB/sec.
  • 35.6 TFLOPS for LINPACK (2002-March)
  • 26 TFLOPS for AFES (Climate Simulation)
motivations
08-APR22Motivations
  • GeoFEM Project (FY.1998-2002)
  • FEM-type applications with complicated unstructured grids (not LINPACK, FDM …) on the Earth Simulator (ES)
    • Implicit Linear Solvers
    • Hybrid vs. Flat MPI Parallel Programming Model
flat mpi vs hybrid
08-APR22

Memory

Memory

P

E

P

E

P

E

P

E

P

E

P

E

P

E

P

E

Memory

Memory

P

E

P

E

P

E

P

E

P

E

P

E

P

E

P

E

Flat MPI vs. Hybrid

Flat-MPI:Each PE -> Independent

Hybrid:Hierarchal Structure

slide24
08-APR22
  • Background
    • Vector/Scalar Processors
    • GeoFEM Project
    • Earth Simulator
    • Preconditioned Iterative Linear Solvers
  • Optimization Strategy on the Earth Simulator
    • BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications
    • Matrix Assembling
  • Summary & Future Works
direct iterative methods for linear equations
08-APR22Direct/Iterative Methods for Linear Equations
  • Direct Methods
    • Gaussian Elimination/LU Factorization.
      • compute A-1 directly.
    • Robust for wide range of applications.
    • More expensive than iterative methods (memory, CPU)
    • Not suitable for parallel and vector computation due to its global operations.
  • Iterative Methods
    • CG, GMRES, BiCGSTAB
    • Less expensive than direct methods, especially in memory.
    • Suitable for parallel and vector computing.
    • Convergence strongly depends on problems, boundary conditions (condition number etc.)
      • Preconditioning is required
preconditioning for iterative methods
08-APR22Preconditioning for Iterative Methods
  • Convergence rate of iterative solvers strongly depends on the spectral properties (eigenvalue distribution) of the coefficient matrix A.
  • A preconditioner M transforms the linear system into one with more favorable spectral properties
    • In "ill-conditioned" problems, "condition number" (ratio of max/min eigenvalue if A is symmetric) is large.
    • M transforms original equation Ax=b into A\'x=b\' where A\'=M-1A, b\'=M-1b
  • ILU (Incomplete LU Factorization) or IC (Incomplete Cholesky Factorization) are well-known preconditioners.
strategy in geofem
08-APR22Strategy in GeoFEM
  • Iterative method is the ONLY choice for large-scale parallel computing.
  • Preconditioning is important
    • general methods, such as ILU(0)/IC(0), cover wide range of applications.
    • problem specific methods.
slide28
08-APR22
  • Background
    • Vector/Scalar Processors
    • GeoFEM Project
    • Earth Simulator
    • Preconditioned Iterative Linear Solvers
  • Optimization Strategy on the Earth Simulator
    • BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications
    • Matrix Assembling
  • Summary & Future Works
block ic 0 cg solver on the earth simulator
08-APR22Block IC(0)-CG Solver on the Earth Simulator
  • 3D Linear Elastic Problems (SPD)
  • Parallel Iterative Linear Solver
    • Node-based Local Data Structure
    • Conjugate-Gradient Method (CG): SPD
    • Localized Block IC(0) Preconditioning (Block Jacobi)
      • Modified IC(0): Non-diagonal components of original [A] are kept
    • Additive Schwartz Domain Decomposition(ASDD)
    • http://geofem.tokyo.rist.or.jp/
  • Hybrid Parallel Programming Model
    • OpenMP+MPI
    • Re-Ordering for Vector/Parallel Performance
    • Comparison with Flat MPI
flat mpi vs hybrid1
08-APR22

Memory

Memory

P

E

P

E

P

E

P

E

P

E

P

E

P

E

P

E

Memory

Memory

P

E

P

E

P

E

P

E

P

E

P

E

P

E

P

E

Flat MPI vs. Hybrid

Flat-MPI:Each PE -> Independent

Hybrid:Hierarchal Structure

local data structure node based partitioning internal nodes elements external nodes
08-APR22

PE#1

PE#1

PE#0

4

5

6

12

15

6

7

21

21

21

22

22

22

23

23

23

24

24

24

25

25

25

PE#0

1

11

2

3

14

13

4

5

17

17

17

18

18

18

19

19

19

16

16

16

20

20

20

10

1

2

3

7

8

9

10

12

12

12

13

13

13

14

14

14

11

10

12

11

11

11

15

15

15

8

9

11

12

10

9

11

12

7

7

7

8

8

8

9

9

9

6

6

6

10

10

10

5

9

6

3

8

8

6

4

4

5

5

5

5

PE#3

PE#2

1

2

7

7

1

2

3

PE#3

PE#2

Local Data StructureNode-based Partitioninginternal nodes - elements - external nodes
basic strategy for parallel programming on the earth simulator
08-APR22Basic Strategy for Parallel Programming on the Earth Simulator
  • Hypothesis
    • Explicit ordering is required for unstructured grids in order to achieve higher performance in factorization processes on SMP node and vector processors.
ilu 0 ic 0 factorization
08-APR22ILU(0)/IC(0) Factorization

do i= 2, n

do k= 1, i-1

if ((i,k)∈ NonZero(A)) then

aik := aik/akk

endif

do j= k+1, n

if ((i,j)∈ NonZero(A)) then

aij := aij - aik*akj

endif

enddo

enddo

enddo

ilu ic preconditioning
08-APR22ILU/IC Preconditioning

M = (L+D)D-1(D+U)

Mz= r

D-1(D+U)= z1

Forward Substitution

(L+D)z1= r : z1= D-1(r-Lz1)

Backward Substitution

(I+ D-1 U)zNEW = z1

z= z - D-1Uz

need to solve this equation

ilu ic preconditioning1
08-APR22ILU/IC Preconditioning

do i= 1, N

WVAL= R(i)

do j= 1, INL(i)

WVAL= WVAL - AL(i,j) * Z(IAL(i,j))

enddo

Z(i)= WVAL / D(i)

enddo

do i= N, 1, -1

SW = 0.0d0

do j= 1, INU(i)

SW= SW + AU(i,j) * Z(IAU(i,j))

enddo

Z(i)= Z(i) – SW / D(i)

enddo

Forward Substitution

(L+D)z= r : z= D-1(r-Lz)

M =(L+D)D-1(D+U)

L,D,U: A

Dependency…

You need the most

recent value of “z”

of connected nodes.

Vectorization/parallelization is difficult.

Backward Substitution

(I+ D-1 U)znew= zold : z= z - D-1Uz

basic strategy for parallel programming on the earth simulator1
08-APR22Basic Strategy for Parallel Programming on the Earth Simulator
  • Hypothesis
    • Explicit ordering is required for unstructured grids in order to achieve higher performance in factorization processes on SMP node and vector processors.
  • Re-Ordering for Highly Parallel/Vector Performance
    • Local operation and no global dependency
    • Continuous memory access
    • Sufficiently long loops for vectorization
ilu ic preconditioning2
08-APR22ILU/IC Preconditioning

do i= 1, N

WVAL= R(i)

do j= 1, INL(i)

WVAL= WVAL - AL(i,j) * Z(IAL(i,j))

enddo

Z(i)= WVAL / D(i)

enddo

do i= N, 1, -1

SW = 0.0d0

do j= 1, INU(i)

SW= SW + AU(i,j) * Z(IAU(i,j))

enddo

Z(i)= Z(i) – SW / D(i)

enddo

Forward Substitution

(L+D)z= r : z= D-1(r-Lz)

M =(L+D)D-1(D+U)

L,D,U: A

Dependency…

You need the most

recent value of “z”

of connected nodes.

Vectorization/parallelization is difficult.

Reordering:

Directly connected

nodes do not appear

in RHS.

Backward Substitution

(I+ D-1 U)znew= zold : z= z - D-1Uz

basic strategy for parallel programming on the earth simulator2
08-APR22Basic Strategy for Parallel Programming on the Earth Simulator
  • Hypothesis
    • Explicit ordering is required for unstructured grids in order to achieve higher performance in factorization processes on SMP node and vector processors.
  • Re-Ordering for Highly Parallel/Vector Performance
    • Local operation and no global dependency
    • Continuous memory access
    • Sufficiently long loops for vectorization
  • 3-Way Parallelism for Hybrid Parallel Programming
    • Inter Node : MPI
    • Intra Node : OpenMP
    • Individual PE : Vectorization
slide40
08-APR22

Vector

SMP Parallel

Vector

SMP Parallel

These processes can be substituted by traditional multi-coloring (MC).

Re-Ordering Technique for Vector/Parallel ArchitecturesCyclic DJDS(RCM+CMC) Re-Ordering(Doi, Washio, Osoda and Maruyama (NEC))

1. RCM (Reverse Cuthil-Mckee)

2. CMC (Cyclic Multicolor)

3. DJDS re-ordering

4. Cyclic DJDS for SMP unit

reordering coloring
08-APR22Reordering = Coloring
  • COLOR: Unit of independent sets.
  • Elements grouped in the same “color” are independent from each other, thus parallel/vector operation is possible.
  • Many colors provide faster convergence, but shorter vector length.: Trade-Off !!

RCM (Reverse CM)

Red-Black (2 colors)

4 colors

large scale sparse matrix storage for unstructured grids
08-APR22

2D-Storage

long vector length, many ZERO’s

1D-Storage (CRS)

memory saved, short vector length

Large Scale Sparse Matrix Storage for Unstructured Grids
re ordering in each color according to non zero off diag component
08-APR22Re-Ordering in each Coloraccording to Non-Zero Off-Diag. Component #

Elements on the same color are independent, thereforeintra-hyperplane re-ordering does not affect results

DJDS : Descending-Order Jagged Diagonal Storage

cyclic djds mc cm rcm cyclic re ordering for smp units load balancing among pes
08-APR22

iS+1

iv0+1

iE

Cyclic DJDS (MC/CM-RCM)Cyclic Re-Ordering for SMP unitsLoad-balancing among PEs

do iv= 1, NCOLORS

!$omp parallel do

do ip= 1, PEsmpTOT

iv0= STACKmc(PEsmpTOT*(iv-1)+ip- 1)

do j= 1, NLhyp(iv)

iS= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip-1)

iE= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip )

!cdir nodep

do i= iv0+1, iv0+iE-iS

k= i+iS - iv0

kk= IAL(k)

(Important Computations)

enddo

enddo

enddo

enddo

npLX1= NLmax * PEsmpTOT

INL(0:NLmax*PEsmpTOT*NCOLORS)

difference between flat mpi hybrid
08-APR22Difference betweenFlat MPI & Hybrid
  • Most of the efforts of re-ordering are for vectorization.
  • If you have a long vector, just divide it and distribute the segments to PEs on SMP nodes.
  • Source codes of Hybrid and Flat MPI are not so different.
    • Flat MPI corresponds to Hybrid where PE/SMP node=1.
    • In other words, Flat MPI code is sufficiently complicated.
simple 3d cubic model
08-APR22

z

Uniform Distributed Force in z-dirrection @ z=Zmin

Uy=0 @ y=Ymin

Ux=0 @ x=Xmin

Nz-1

Ny-1

y

Uz=0 @ z=Zmin

Nx-1

x

Simple 3D Cubic Model
effect of re ordering
08-APR22

Long Loops

Continuous Access

Short Loops

Irregular Access

Short Loops

Continuous Access

Effect of Re-Ordering
matrix storage loops
08-APR22Matrix Storage, Loops

do iv= 1, NVECT

iv0= STACKmc(iv-1)

do j= 1, NLhyp(iv)

iS= index_L(NL*(iv-1)+ j-1)

iE= index_L(NL*(iv-1)+ j )

do i= iv0+1, iv0+iE-iS

k= i+iS - iv0

kk= item_L(k)

Z(i)= Z(i) - AL(k)*Z(kk)

enddo

iS= STACKmc(iv-1) + 1

iE= STACKmc(iv )

do i= iS, iE

Z(i)= Z(i)/DD(i)

enddo

enddo

enddo

DJDS

  • DJDS (Descending order Jagged Diagonal Storage) with long innermost loops is suitable for vector processors.

DCRS

do i= 1, N

SW= WW(i,Z)

isL= index_L(i-1)+1

ieL= index_L(i)

do j= isL, ieL

k= item_L(j)

SW= SW - AL(j)*Z(k)

enddo

Z(i)= SW/DD(i)

enddo

  • Reduction type loop of DCRS is more suitable for cache-based scalar processor because of its localized operation.
effect of re ordering results on 1 smp node color 99 fixed re ordering is really required
08-APR22

Effect of Vector Length

X10

+ Re-Ordering X100

22 GFLOPS,

34% of the Peak

Ideal Performance: 40%-45%

for Single CPU

Effect of Re-OrderingResults on 1 SMP nodeColor #: 99 (fixed)Re-Ordering is REALLY required !!!

●■▲

effect of re ordering results on 1 smp node color 99 fixed re ordering is really required1
08-APR22Effect of Re-OrderingResults on 1 SMP nodeColor #: 99 (fixed)Re-Ordering is REALLY required !!!

●■▲

80x80x80 case (1.5M DOF)

● 212 iter’s, 11.2 sec.

■ 212 iter’s, 143.6 sec.

▲ 203 iter’s, 674.2 sec.

3d elastic simulation problem size gflops earth simulator 1 smp node 8 pe s
08-APR223D Elastic SimulationProblem Size~GFLOPSEarth Simulator1 SMP node (8 PE’s)

Flat-MPI is better.

Nice Intra-Node MPI.

Flat-MPI

23.4 GFLOPS, 36.6 % of Peak

Hybrid (OpenMP)

21.9 GFLOPS, 34.3 % of Peak

slide54
08-APR22

IBM SP3

(LBNL)

Hitachi

SR11000/J2

(U.Tokyo)

1.50

9.20

0.623

8.00

.072-.076

(4.8-5.1)

.880-.973

(9.6-10.6)

.122

(8.1)

1.34

(14.5)

0.413

0.870

1.00

12.0

16.3

4.7 *

Earth

Simulator

Hitachi

SR8000/MPP

(U.Tokyo)

Peak Performance

GFLOPS

8.00

1.80

Measured Memory

BW (STREAM)

(GB/sec/PE)

26.6

2.85

Estimated

Performance/Core

GFLOPS

(% of peak)

2.31-3.24

(28.8-40.5)

.291-.347

(16.1-19.3)

Measured

Performance/Core

2.93

(36.6)

.335

(18.6)

BYTE/FLOP

3.325

1.583

Comm. BW

(GB/sec/Node)

12.3

1.60

* IBM p595

J.T.Carter

et al.

MPI Latency

(msec)

5.6-7.7

6-20

slide55
08-APR223D Elastic SimulationProblem Size~GFLOPSHitachi-SR8000-MPP withPseudo Vectorization1 SMP node (8 PE’s)

Hybrid is better.

Low Intra-Node MPI.

Flat-MPI

2.17 GFLOPS, 15.0 % of Peak

Hybrid (OpenMP)

2.68 GFLOPS, 18.6 % of Peak

3d elastic simulation problem size gflops ibm sp3 nersc 1 smp node 8 pe s
08-APR223D Elastic SimulationProblem Size~GFLOPSIBM-SP3 (NERSC) 1 SMP node (8 PE’s)

Cache is well-utilized

in Flat-MPI.

Flat-MPI

Hybrid (OpenMP)

3d elastic simulation problem size gflops hitachi sr11000 j2 u tokyo 1 smp node 8 pe s
08-APR223D Elastic SimulationProblem Size~GFLOPSHitachi SR11000/J2 (U.Tokyo) 1 SMP node (8 PE’s)

Flat-MPI

Hybrid (OpenMP)

smp node 10 up to 176 nodes 1408 pes problem size for each smp node is fixed pdjds cm rcm color 99
08-APR22SMP node # > 10up to 176 nodes (1408 PEs)Problem size for each SMP node is fixed.PDJDS-CM/RCM, Color #: 99
3d elastic model large case 256x128x128 smp node up to 2 214 592 512 dof
08-APR223D Elastic Model (Large Case)256x128x128/SMP node, up to 2,214,592,512 DOF

Parallel Work Ratio

GFLOPS rate

○Hybrid

●Flat MPI

3.8TFLOPS for 2.2G DOF

176 nodes (33.8% of peak)

●:Flat MPI, ○:Hybrid

3d elastic model small case 64x64x64 smp node up to 125 829 120 dof
08-APR223D Elastic Model (Small Case)64x64x64/SMP node, up to 125,829,120 DOF

Parallel Work Ratio

GFLOPS rate

○Hybrid

●Flat MPI

●:Flat MPI, ○:Hybrid

hybrid outperforms flat mpi
08-APR22Hybrid outperforms Flat-MPI
  • when ...
    • number of SMP node (PE) is large.
    • problem size/node is small.
  • because flat-MPI has ...
    • as 8 times as many communication processes
    • as TWICE as large communication/computation ratio
  • Effect of communication becomes significant if number of SMP node (or PE) is large.
  • Performance Estimation by D.Kerbyson (LANL)
    • LA-UR-02-5222
    • relatively larger communication latency of ES
slide63
08-APR22

IBM SP3

(LBNL)

Hitachi

SR11000/J2

(U.Tokyo)

1.50

9.20

0.623

8.00

.072-.076

(4.8-5.1)

.880-.973

(9.6-10.6)

.122

(8.1)

1.34

(14.5)

0.413

0.870

1.00

12.0

16.3

4.7 *

Earth

Simulator

Hitachi

SR8000/MPP

(U.Tokyo)

Peak Performance

GFLOPS

8.00

1.80

Measured Memory

BW (STREAM)

(GB/sec/PE)

26.6

2.85

Estimated

Performance/Core

GFLOPS

(% of peak)

2.31-3.24

(28.8-40.5)

.291-.347

(16.1-19.3)

Measured

Performance/Core

2.93

(36.6)

.335

(18.6)

BYTE/FLOP

3.325

1.583

Comm. BW

(GB/sec/Node)

12.3

1.60

* IBM p595

J.T.Carter

et al.

MPI Latency

(msec)

5.6-7.7

6-20

why communication overhead
08-APR22Why Communication Overhead ?
  • latency of network
  • finite bandwidth of network
  • synchronization at SEND/RECV, ALLREDUCE etc.
  • memory performance in boundary communications (memory COPY)
domain to domain communication exchange boundary information send recv
08-APR22

do neib= 1, NEIBPETOT

istart= IMPORT_INDEX(neib-1)

inum = IMPORT_INDEX(neib ) - istart

call MPI_IRECV

(WR(istart+1), inum, MPI_DOUBLE_PRECISION, &

NEIBPE(neib), 0, SOLVER_COMM, &

req2(neib), ierr)

enddo

call MPI_WAITALL (NEIBPETOT, req2, sta2, ierr)

do neib= 1, NEIBPETOT

istart= IMPORT_INDEX(neib-1)

inum = IMPORT_INDEX(neib ) - istart

do k= istart+1, istart+inum

X(IMPORT_NODE(k))= WR(k)

enddo

enddo

call MPI_WAITALL (NEIBPETOT, req1, sta1, ierr)

return

end

RECEIVE

do neib= 1, NEIBPETOT

istart= EXPORT_INDEX(neib-1)

inum = EXPORT_INDEX(neib ) - istart

do k= istart+1, istart+inum

WS(k)= X(EXPORT_NODE(k))

enddo

call MPI_ISEND

(WS(istart+1), inum, MPI_DOUBLE_PRECISION, &

NEIBPE(neib), 0, SOLVER_COMM, &

req1(neib), ierr)

enddo

SEND

Domain-to-Domain CommunicationExchange Boundary Information (SEND/RECV)

subroutine SOLVER_SEND_RECV &

(N, NEIBPETOT, NEIBPE, &

IMPORT_INDEX, IMPORT_NODE, &

EXPORT_INDEX, EXPORT_NODE, &

WS, WR, X, SOLVER_COMM, my_rank)

implicit REAL*8 (A-H,O-Z)

include \'mpif.h\'

parameter (KREAL= 8)

integer IMPORT_INDEX(0:NEIBPETOT), IMPORT_NODE(N)

integer EXPORT_INDEX(0:NEIBPETOT), EXPORT_NODE(N)

integer SOLVER_COMM, my_rank

integer req1(NEIBPETOT), req2(NEIBPETOT)

integer sta1(MPI_STATUS_SIZE, NEIBPETOT)

integer sta2(MPI_STATUS_SIZE, NEIBPETOT)

real(kind=KREAL) X(N), NEIBPE(NEIBPETOT), WS(N), WR(N)

domain to domain communication exchange boundary information send recv1
08-APR22

do neib= 1, NEIBPETOT

istart= IMPORT_INDEX(neib-1)

inum = IMPORT_INDEX(neib ) - istart

call MPI_IRECV

(WR(istart+1), inum, MPI_DOUBLE_PRECISION, &

NEIBPE(neib), 0, SOLVER_COMM, &

req2(neib), ierr)

enddo

call MPI_WAITALL (NEIBPETOT, req2, sta2, ierr)

do neib= 1, NEIBPETOT

istart= IMPORT_INDEX(neib-1)

inum = IMPORT_INDEX(neib ) - istart

do k= istart+1, istart+inum

X(IMPORT_NODE(k))= WR(k)

enddo

enddo

call MPI_WAITALL (NEIBPETOT, req1, sta1, ierr)

return

end

RECEIVE

do neib= 1, NEIBPETOT

istart= EXPORT_INDEX(neib-1)

inum = EXPORT_INDEX(neib ) - istart

do k= istart+1, istart+inum

WS(k)= X(EXPORT_NODE(k))

enddo

call MPI_ISEND

(WS(istart+1), inum, MPI_DOUBLE_PRECISION, &

NEIBPE(neib), 0, SOLVER_COMM, &

req1(neib), ierr)

enddo

SEND

Domain-to-Domain CommunicationExchange Boundary Information (SEND/RECV)

subroutine SOLVER_SEND_RECV &

(N, NEIBPETOT, NEIBPE, &

IMPORT_INDEX, IMPORT_NODE, &

EXPORT_INDEX, EXPORT_NODE, &

WS, WR, X, SOLVER_COMM, my_rank)

implicit REAL*8 (A-H,O-Z)

include \'mpif.h\'

parameter (KREAL= 8)

integer IMPORT_INDEX(0:NEIBPETOT), IMPORT_NODE(N)

integer EXPORT_INDEX(0:NEIBPETOT), EXPORT_NODE(N)

integer SOLVER_COMM, my_rank

integer req1(NEIBPETOT), req2(NEIBPETOT)

integer sta1(MPI_STATUS_SIZE, NEIBPETOT)

integer sta2(MPI_STATUS_SIZE, NEIBPETOT)

real(kind=KREAL) X(N), NEIBPE(NEIBPETOT), WS(N), WR(N)

communication overhead
08-APR22Communication Overhead

Memory

Copy

Comm.

Latency

Comm.

BandWidth

communication overhead1
08-APR22Communication Overhead

Memory

Copy

Comm.

Latency

Comm.

BandWidth

depends on

message size

depends on

message size

communication overhead weak scaling earth simulator
08-APR22Communication OverheadWeak Scaling: Earth Simulator

Effect of message size

is small. Effect of latency

is large.

Memory-copy is so fast.

communication overhead weak scaling ibm sp 3
08-APR22Communication OverheadWeak Scaling: IBM SP-3

Effect of message size

is more significant.

summary
08-APR22Summary
  • Hybrid Parallel Programming Model on SMP Cluster Architecture with Vector Processors for Unstructured Grids
  • Nice parallel performance for both inter/intra SMP node on ES, 3.8TFLOPS for 2.2G DOF on 176 nodes (33.8%) in 3D linear-elastic problem using BIC(0)-CG method.
    • N.Kushida (student of Prof.Okuda) attained >10 TFLOPS using 512 nodes for >3G DOF problem.
  • Re-Ordering is really required
summary cont
08-APR22Summary (cont.)
  • Hybrid vs. Flat MPI
    • Flat-MPI is better for small number of SMP nodes.
    • Hybrid is better for large number of SMP nodes: Especially when problem size is rather small.
    • Flat MPI: Communication, Hybrid: Memory
    • depends on application, problem size etc.
    • Hybrid is much more sensitive to color numbers than flat MPI due to synchronization overhead of OpenMP.
  • In Mat-Vec. operations, difference is not so significant.
slide79
08-APR22
  • Background
    • Vector/Scalar Processors
    • GeoFEM Project
    • Earth Simulator
    • Preconditioned Iterative Linear Solvers
  • Optimization Strategy on the Earth Simulator
    • BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications
    • Matrix Assembling
  • Summary & Future Works
cube benchmark
08-APR22“CUBE” Benchmark
  • 3D linear elastic applications on cubes for a wide range of problem size.
  • Hardware
    • Single CPU
    • Earth Simulator
    • AMD Opteron (1.8GHz)
time for 3x64 3 786 432 dof
08-APR22

DCRS

sec.

(MFLOPS)

DJDS original

sec.

(MFLOPS)

ES

8.0 GFLOPS

Matrix

28.6

(291)

34.2

(240)

21.7

(3246)

Solver

360

(171)

Total

389

55.9

Opteron

1.8GHz

3.6 GFLOPS

Matrix

10.2

(818)

12.4

(663)

271

(260)

Solver

225

(275)

Total

235

283

Time for 3x643=786,432 DOF
matrix solver
08-APR22

Matrix

Matrix+Solver

Solver

DJDS on ES

original

DJDS on Opteron

original

computation time vs problem size
08-APR22Computation Time vs. Problem Size

Matrix

Solver

Total

ES (DJDS original)

Opteron

(DJDS original)

Opteron (DCRS)

matrix assembling formation part is rather expensive
08-APR22Matrix assembling/formation part is rather expensive
  • This part should be also optimized for vector processors.
  • For example, in nonlinear simulations such as elasto-plastic solid simulations, or fully coupled Navier-Stokes flow simulations, matrices must be updated for every nonlinear iterations.
  • This part strongly depends on applications/physics, therefore it’s very difficult to develop general libraries, such as those of iterative linear solvers.
    • also includes complicated processes which are difficult to be “vectorized”
typical procedure for calculating coefficient matrix in fem
08-APR22Typical Procedure for Calculating Coefficient Matrix in FEM
  • Apply Galerkin’s method on each element.
  • Integration over each element, and get element-matrix.
  • Element matrices are accumulated to each node, and global matrices are obtained

=> Global linear equations

  • Matrix assembling/formation is embarrassingly parallel procedure due to its element-by-element feature
element by element operations
08-APR22Element-by-Element Operations
  • Integration over each element => element-matrix
  • Element matrices are accumulated to each node

=> global-matrix

  • Linear equations for each node

19

20

21

22

23

24

Elements

13

14

15

16

17

18

7

8

9

10

11

12

1

2

3

4

5

6

element by element operations1
08-APR22Element-by-Element Operations
  • Integration over each element => element-matrix
  • Element matrices are accumulated to each node

=> global-matrix

  • Linear equations for each node

29

30

31

32

33

34

35

19

20

21

22

23

24

Nodes

22

23

24

25

26

27

28

13

14

15

16

17

18

15

16

17

18

19

20

21

7

8

9

10

11

12

8

9

10

11

12

13

14

1

2

3

4

5

6

1

2

3

4

5

6

7

element by element operations2
08-APR22

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Element-by-Element Operations
  • Integration over each element => element-matrix
  • Element matrices are accumulated to each node

=> global-matrix

  • Linear equations for each node
element by element operations3
08-APR22Element-by-Element Operations
  • Integration over each element => element-matrix
  • Element matrices are accumulated to each node

=> global-matrix

  • Linear equations for each node
element by element operations4
08-APR22Element-by-Element Operations
  • Integration over each element => element-matrix
  • Element matrices are accumulated to each node

=> global-matrix

  • Linear equations for each node

29

30

31

32

33

34

35

22

23

24

25

26

27

28

15

16

17

18

19

20

21

8

9

10

11

12

13

14

1

2

3

4

5

6

7

element by element operations5
08-APR22Element-by-Element Operations
  • Integration over each element => element-matrix
  • Element matrices are accumulated to each node

=> global-matrix

  • Linear equations for each node
element by element operations6
08-APR22Element-by-Element Operations

22

23

24

13

14

15

16

17

7

8

8

9

10

  • If you calculate a23,16 and a16,23, you have to consider contribution by both of 13th and 14th elements.
current approach
08-APR22Current Approach

do icel= 1, ICELTOT

do ie= 1, 4

do je= 1, 4

- assemble element-matrix

- accumulated element-matrix to global-matrix

enddo

enddo

enddo

22

23

24

13

14

15

16

17

7

8

8

9

10

current approach1
08-APR22Current Approach

do icel= 1, ICELTOT

do ie= 1, 4

do je= 1, 4

- assemble element-matrix

- accumulat element-matrix to global-matrix

enddo

enddo

enddo

22

23

24

13

14

Local Node ID

15

16

17

7

8

8

9

10

4

3

Local Node ID

for each bi-linear 4-node

element

1

2

current approach2
08-APR22Current Approach

do icel= 1, ICELTOT

do ie= 1, 4

do je= 1, 4

- assemble element-matrix

- accumulat element-matrix to global-matrix

enddo

enddo

enddo

22

23

24

13

14

Local Node ID

15

16

17

7

8

8

9

10

  • Nice for cache reuse because of localized operations
  • Not suitable for vector processors
    • a16,23 and a23,16 might not be calculated properly.
    • Short innermost loops
    • There are many “if-then-else” s
inside the loop integration at gaussian quadrature points
08-APR22Inside the loop: integration at Gaussian quadrature points

do jpn= 1, 2

do ipn= 1, 2

coef= dabs(DETJ(ipn,jpn))*WEI(ipn)*WEI(jpn)

PNXi= PNX(ipn,jpn,ie)

PNYi= PNY(ipn,jpn,ie)

PNXj= PNX(ipn,jpn,je)

PNYj= PNY(ipn,jpn,je)

a11= a11 + (valX*PNXi*PNXj + valB*PNYi*PNYj)*coef

a22= a22 + (valX*PNYi*PNYj + valB*PNXi*PNXj)*coef

a12= a12 + (valA*PNXi*PNYj + valB*PNXj*PNYi)*coef

a21= a21 + (valA*PNYi*PNXj + valB*PNYj*PNXi)*coef

enddo

enddo

remedy
08-APR22Remedy

do icel= 1, ICELTOT

do ie= 1, 4

do je= 1, 4

- assemble element-matrix

- accumulat element-matrix to global-matrix

enddo

enddo

enddo

22

23

24

13

14

15

16

17

7

8

8

9

10

  • a16,23 and a23,16 might not be calculated properly.
    • coloring the elements: elements which do not share any nodes are in same color.
coloring of elements
08-APR22Coloring of Elements

29

30

31

32

33

34

35

22

23

24

25

26

27

28

15

16

17

18

19

20

21

8

9

10

11

12

13

14

1

2

3

4

5

6

7

coloring of elements1
08-APR22Coloring of Elements

Elements sharing the 16th node are assigned to different colors

29

30

31

32

33

34

35

22

23

24

25

26

27

28

15

16

17

18

19

20

21

8

9

10

11

12

13

14

1

2

3

4

5

6

7

remedy1
08-APR22Remedy

do icel= 1, ICELTOT

do ie= 1, 4

do je= 1, 4

- assemble element-matrix

- accumulat element-matrix to global-matrix

enddo

enddo

enddo

22

23

24

13

14

15

16

17

7

8

8

9

10

  • a16,23 and a23,16 might not be calculated properly.
    • coloring the elements: elements which do not share any nodes are in same color.
  • Short innermost loops
    • loop exchange
remedy2
08-APR22Remedy

do icel= 1, ICELTOT

do ie= 1, 4

do je= 1, 4

- assemble element-matrix

- accumulat element-matrix to global-matrix

enddo

enddo

enddo

22

23

24

13

14

15

16

17

7

8

8

9

10

  • a16,23 and a23,16 might not be calculated properly.
    • coloring the elements: elements which do not share any nodes are in same color.
  • Short innermost loops
    • loop exchange
  • There are many “if-then-else” s
    • define ELEMENT-to-MATRIX array
define element to matrix array
08-APR22Define ELEMENT-to-MATRIX array

22

23

24

13

14

22

23

23

24

13

14

15

16

17

Local Node ID

7

8

15

16

16

17

8

9

10

if

kkU= index_U(16-1+k) and

item_U(kkU)= 23 then

ELEMmat(13,2,3)= +kkU

ELEMmat(14,1,4)= +kkU

endif

if

kkL= index_L(23-1+k) and

item_L(kkL)= 16 then

ELEMmat(13,3,2)= -kkL

ELEMmat(14,4,1)= -kkL

endif

ELEMmat(icel, ie, je)

Local Node ID

Element ID

define element to matrix array1
08-APR22Define ELEMENT-to-MATRIX array

22

23

24

13

14

22

23

23

24

13

14

15

16

17

Local Node ID

7

8

15

16

16

17

8

9

10

if

kkU= index_U(16-1+k) and

item_U(kkU)= 23 then

ELEMmat(13,2,3)= +kkU

ELEMmat(14,1,4)= +kkU

endif

if

kkL= index_L(23-1+k) and

item_L(kkL)= 16 then

ELEMmat(13,3,2)= -kkL

ELEMmat(14,4,1)= -kkL

endif

“ELEMmat” specifies relationship between node pairs of each node of each element and address of global coefficient matrix.

optimized procedure
08-APR22Optimized Procedure

do icol= 1, NCOLOR_E_tot

do ie= 1, 4

do je= 1, 4

do ic0= index_COL(icol-1)+1, indexCOL_(icol)

icel= item_COL(ic0)

- define “ELEMmat” array

enddo

enddo

enddo

enddo

  • Extra Computation for
  • ELEMmat
  • Extra Storage for
  • ELEMmat array
  • element-matrix components for elements in each color
  • < 10% increase

do icol= 1, NCOLOR_E_tot

do ie= 1, 4

do je= 1, 4

do ic0= index_COL(icol-1)+1, indexCOL_(icol)

icel= item_COL(ic0)

- assemble element-matrix

enddo

enddo

enddo

do ie= 1, 4

do je= 1, 4

do ic0= index_COL(icol-1)+1, indexCOL_(icol)

icel= item_COL(ic0)

- accumulate element-matrix to global-matrix

enddo

enddo

enddo

enddo

optimized procedure1
08-APR22Optimized Procedure

PART I

“Integer” operations for “ELEMmat”

In nonlinear cases, this part should be done just once (before initial iteration), as long as mesh connectivity does not change.

do icol= 1, NCOLOR_E_tot

do ie= 1, 4

do je= 1, 4

do ic0= index_COL(icol-1)+1, indexCOL_(icol)

icel= item_COL(ic0)

- define “ELEMmat” array

enddo

enddo

enddo

enddo

do icol= 1, NCOLOR_E_tot

do ie= 1, 4

do je= 1, 4

do ic0= index_COL(icol-1)+1, indexCOL_(icol)

icel= item_COL(ic0)

- assemble element-matrix

enddo

enddo

enddo

do ie= 1, 4

do je= 1, 4

do ic0= index_COL(icol-1)+1, indexCOL_(icol)

icel= item_COL(ic0)

- accumulate element-matrix to global-matrix

enddo

enddo

enddo

enddo

optimized procedure2
08-APR22Optimized Procedure

PART I

“Integer” operations for “ELEMmat”

In nonlinear cases, this part should be done just once (before initial iteration), as long as mesh connectivity does not change.

do icol= 1, NCOLOR_E_tot

do ie= 1, 4

do je= 1, 4

do ic0= index_COL(icol-1)+1, indexCOL_(icol)

icel= item_COL(ic0)

- define “ELEMmat” array

enddo

enddo

enddo

enddo

do icol= 1, NCOLOR_E_tot

do ie= 1, 4

do je= 1, 4

do ic0= index_COL(icol-1)+1, indexCOL_(icol)

icel= item_COL(ic0)

- assemble element-matrix

enddo

enddo

enddo

do ie= 1, 4

do je= 1, 4

do ic0= index_COL(icol-1)+1, indexCOL_(icol)

icel= item_COL(ic0)

- accumulate element-matrix to global-matrix

enddo

enddo

enddo

enddo

PART II

“Floating” operations for matrix assembling/accumulation.

In nonlinear cases, this part is repeated for every nonlinear iteration.

time for 3x64 3 786 432 dof1
08-APR22Time for 3x643=786,432 DOF

DCRS

sec.

(MFLOPS)

DJDS original

sec.

(MFLOPS)

DJDS improved

sec.

(MFLOPS)

ES

8.0 GFLOPS

12.5

(643)

Matrix

28.6

(291)

34.2

(240)

21.7

(3246)

21.7

(3246)

Solver

360

(171)

Total

389

55.9

34.2

21.2

(381)

Opteron

1.8GHz

3.6 GFLOPS

Matrix

10.2

(818)

12.4

(663)

271

(260)

271

(260)

Solver

225

(275)

Total

235

283

292

time for 3x64 3 786 432 dof2
08-APR22Time for 3x643=786,432 DOF

DCRS

sec.

(MFLOPS)

DJDS original

sec.

(MFLOPS)

DJDS improved

sec.

(MFLOPS)

ES

8.0 GFLOPS

12.5

(643)

Matrix

28.6

(291)

34.2

(240)

21.7

(3246)

21.7

(3246)

Solver

360

(171)

Total

389

55.9

34.2

21.2

(381)

Opteron

1.8GHz

3.6 GFLOPS

Matrix

10.2

(818)

12.4

(663)

271

(260)

271

(260)

Solver

225

(275)

Slower than original

because of long innermost

loops (data locality has been

lost)

Total

235

283

292

matrix solver1
08-APR22

Matrix

Matrix+Solver

Solver

DJDS on ES

improved

DJDS on ES

original

DJDS on Opteron

improved

DJDS on Opteron

original

computation time vs problem size1
08-APR22Computation Time vs. Problem Size

Matrix

Solver

Total

ES (DJDS improved)

Opteron

(DJDS improved)

Opteron (DCRS)

optimization of matrix assembling formation on es
08-APR22Optimization of “Matrix” assembling/formation on ES
  • DJDS has been much improved compared to the original one, but it’s still slower than DCRS version on Opteron.
  • “Integer” operation part is slower.
  • But, “floating” operation is much faster than Opteron.
  • In nonlinear simulations, “integer” operation is executed only once (just before initial iteration), therefore, ES outperforms Opteron if the number of nonlinear iterations is more than 2.
suppose virtual mode where
08-APR22Suppose “virtual” mode where …
  • On scalar processor
    • “Integer” operation part
  • On vector processor
    • “floating” operation part
    • linear solvers
  • Scalar performance of ES (500MHz) is smaller than that of Pentium III
time for 3x64 3 786 432 dof3
08-APR22Time for 3x643=786,432 DOF

DCRS

sec.

(MFLOPS)

DJDS improved

sec.

(MFLOPS)

DJDS virtual

sec.

(MFLOPS)

ES

8.0 GFLOPS

12.5

(643)

Matrix

28.6

(291)

1.88

(4431)

21.7

(3246)

21.7

(3246)

Solver

360

(171)

Total

389

34.2

23.6

21.2

(381)

Opteron

1.8GHz

3.6 GFLOPS

Matrix

10.2

(818)

271

(260)

Solver

225

(275)

Total

235

292

summary vectorization of fem appl
08-APR22Summary: Vectorization of FEM appl.
  • NOT so easy
  • FEM’s good features of local operations are not necessarily suitable for vector processors.
    • Preconditioned iterative solvers can be vectorized rather easier, because their target is “global” matrix.
  • Sometimes, major revision of original codes are required
    • Usually, more memory, more lines, additional operations …
  • Performance of optimized codes for vector processor is not necessarily good on scalar processors (e.g. matrix assembling of FEM)
ad