Preconditioned iterative linear solvers for unstructured grids on the earth simulator
This presentation is the property of its rightful owner.
Sponsored Links
1 / 115

Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator PowerPoint PPT Presentation


  • 43 Views
  • Uploaded on
  • Presentation posted in: General

Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator. Kengo Nakajima [email protected] Supercomputing Division, Information Technology Center, The University of Tokyo Japan Agency for Marine-Earth Science and Technology. April 22, 2008. Overview.

Download Presentation

Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Preconditioned iterative linear solvers for unstructured grids on the earth simulator

Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

Kengo Nakajima

[email protected]

Supercomputing Division, Information Technology Center,

The University of Tokyo

Japan Agency for Marine-Earth Science and Technology

April 22, 2008


Overview

08-APR22

Overview

  • Background

    • Vector/Scalar Processors

    • GeoFEM Project

    • Earth Simulator

    • Preconditioned Iterative Linear Solvers

  • Optimization Strategy on the Earth Simulator

    • BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications

    • Matrix Assembling

  • Summary & Future Works


Large scale computing

Large-Scale Computing

  • Scalar Processor

    • Big gap between clock rate and memory bandwidth.

    • Very low sustained/peak performance ratio (<10%)

08-APR22


Scalar processors cpu cache memory hierarchical structure

Scalar ProcessorsCPU-Cache-Memory: Hierarchical Structure

CPU

Register

FAST

Small Capacity (MB)

Expensive

Large (100M of Transistors)

Cache

SLOW

Main Memory

Large Capacity (GB)

Cheap

08-APR22


Large scale computing1

Large-Scale Computing

  • Scalar Processor

    • Big gap between clock rate and memory bandwidth.

    • Very low sustained/peak performance ratio (<10%)

  • Vector Processor

    • Very high sustained/peak performance ratio

      • e.g. 35 % for FEM applications on the Earth Simulator

    • requires …

      • very special tuning

      • sufficiently long loops (= large-scale problem size) for certain performance

    • Suitable for simple computation.

08-APR22


Vector processors vector register fast memory

Vector ProcessorsVector Register & Fast Memory

  • Parallel Processing for Simple DO Loops.

  • Suitable for Simple & Large Computation

Vector Processor

do i= 1, N

A(i)= B(i) + C(i)

enddo

Vector

Register

Very

FAST

Main Memory

08-APR22


Large scale computing2

Large-Scale Computing

  • Scalar Processor

    • Big gap between clock rate and memory bandwidth.

    • Very low sustained/peak performance ratio (<10%)

  • Vector Processor

    • Very high sustained/peak performance ratio

      • e.g. 35 % for FEM applications on the Earth Simulator

    • requires …

      • very special tuning

      • sufficiently long loops (= large-scale problem size) for certain performance

    • Suitable for simple computation.

08-APR22


Preconditioned iterative linear solvers for unstructured grids on the earth simulator

IBM SP3

(LBNL)

Hitachi

SR11000/J2

(U.Tokyo)

1.50

9.20

0.623

8.00

.072-.076

(4.8-5.1)

.880-.973

(9.6-10.6)

.122

(8.1)

1.34

(14.5)

0.413

0.870

1.00

12.0

16.3

4.7 *

Earth

Simulator

Hitachi

SR8000/MPP

(U.Tokyo)

Peak Performance

GFLOPS

8.00

1.80

Measured Memory

BW (STREAM)

(GB/sec/PE)

26.6

2.85

Estimated

Performance/Core

GFLOPS

(% of peak)

2.31-3.24

(28.8-40.5)

.291-.347

(16.1-19.3)

Measured

Performance/Core

2.93

(36.6)

.335

(18.6)

BYTE/FLOP

3.325

1.583

Comm. BW

(GB/sec/Node)

12.3

1.60

* IBM p595

J.T.Carter

et al.

MPI Latency

(msec)

5.6-7.7

6-20

08-APR22


Typical behavior

Typical Behavior …

40 % of peak

8 % of peak

IBM-SP3:

Performance is good for small problems

due to cache effect.

Earth Simulator:

Performance is good for large scale problems due to long vector length.

08-APR22


Parallel computing strong scaling fixed prob size

Parallel ComputingStrong Scaling (Fixed Prob. Size)

Ideal

Ideal

Performance

Performance

PE#

PE#

IBM-SP3:

Super-scalar effect for small number

of PE’s. Performance decreases for

many PE’s due to comm. overhead.

Earth Simulator:

Performance decreases for many PE’s

due to comm. overhead and small vector length.

08-APR22


Improvement of memory performance on ibm sp3 hitachi sr11000 j2

Improvement of Memory Performance on IBM SP3 ⇒ Hitachi SR11000/J2

Memory Performance

(BWDTH, latency etc.)

■ Flat-MPI/DCRS

□ Hybrid/DCRS

2.3 GHz

8.0 GB/sec

18M L3 cache/PE

IBM SP-3

POWER3

375 MHz

1.0 GB/sec

8M L2 cache/PE

SR11000/J2

POWER5+

08-APR22


My history with vector computers

08-APR22

My History with Vector Computers

  • Cray-1S (1985-1988)

    • Mitsubishi Research Institute (MRI)

  • Cray-YMP (1988-1995)

    • MRI, University of Texas at Austin

  • Fujitsu VP, VPP series (1985-1999)

    • JAERI, PNC

  • NEC SX-4 (1997-2001)

    • CCSE/JAERI

  • Hitachi SR2201 (1997-2004)

    • University of Tokyo, CCSE/JAERI

  • Hitachi SR8000 (2000-2007)

    • University of Tokyo

  • Earth Simulator (2002-)


Preconditioned iterative linear solvers for unstructured grids on the earth simulator

08-APR22

  • Background

    • Vector/Scalar Processors

    • GeoFEM Project

    • Earth Simulator

    • Preconditioned Iterative Linear Solvers

  • Optimization Strategy on the Earth Simulator

    • BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications

    • Matrix Assembling

  • Summary & Future Works


Geofem fy 1998 2002 http geofem tokyo rist or jp

08-APR22

GeoFEM: FY.1998-2002http://geofem.tokyo.rist.or.jp/

  • Parallel FEM platform for solid earth simulation.

    • parallel I/O, parallel linear solvers, parallel visualization

    • solid earth: earthquake, plate deformation, mantle/core convection, etc.

  • Part of national project by STA/MEXT for large-scale earth science simulations using the Earth Simulator.

  • Strong collaborations between HPC and natural science (solid earth) communities.


System configuration of geofem

08-APR22

System Configuration ofGeoFEM


Results on solid earth simulation

08-APR22

Results on Solid Earth Simulation

Complicated Plate Model around Japan Islands

Magnetic Field of the Earth : MHD code

Dh=5.00

Dh=1.25

Simulation of Earthquake Generation Cycle

in Southwestern Japan

T=100

T=200

T=300

T=400

T=500

Transportation by Groundwater Flow through Heterogeneous Porous Media

TSUNAMI !!


Results by geofem

08-APR22

Results by GeoFEM


Features of fem applications 1 2

08-APR22

Features of FEM applications (1/2)

  • Local “element-by-element” operations

    • sparse coefficient matrices

    • suitable for parallel computing

  • HUGE “indirect” accesses

    • IRREGULAR sparse matrices

    • memory intensive

do i= 1, N

jS= index(i-1)+1

jE= index(i)

do j= jS, jE

in = item(j)

Y(i)= Y(i) + AMAT(j)*X(in)

enddo

enddo


Features of fem applications 2 2

08-APR22

Features of FEM applications (2/2)

  • In parallel computation …

    • comm. with ONLY neighbors (except “dot products” etc.)

    • amount of messages are relatively small because only values on domain-boundary are exchanged.

    • communication (MPI) latency is critical


Preconditioned iterative linear solvers for unstructured grids on the earth simulator

08-APR22

  • Background

    • Vector/Scalar Processors

    • GeoFEM Project

    • Earth Simulator

    • Preconditioned Iterative Linear Solvers

  • Optimization Strategy on the Earth Simulator

    • BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications

    • Matrix Assembling

  • Summary & Future Works


Earth simulator es http www es jamstec go jp

08-APR22

Earth Simulator (ES)http://www.es.jamstec.go.jp/

  • 640×8= 5,120 Vector Processors

    • SMP Cluster-Type Architecture

    • 8 GFLOPS/PE

    • 64 GFLOPS/Node

    • 40 TFLOPS/ES

  • 16 GB Memory/Node, 10 TB/ES

  • 640×640 Crossbar Network

    • 16 GB/sec×2

  • Memory BWTH with 32 GB/sec.

  • 35.6 TFLOPS for LINPACK (2002-March)

  • 26 TFLOPS for AFES (Climate Simulation)


Motivations

08-APR22

Motivations

  • GeoFEM Project (FY.1998-2002)

  • FEM-type applications with complicated unstructured grids (not LINPACK, FDM …) on the Earth Simulator (ES)

    • Implicit Linear Solvers

    • Hybrid vs. Flat MPI Parallel Programming Model


Flat mpi vs hybrid

08-APR22

Memory

Memory

P

E

P

E

P

E

P

E

P

E

P

E

P

E

P

E

Memory

Memory

P

E

P

E

P

E

P

E

P

E

P

E

P

E

P

E

Flat MPI vs. Hybrid

Flat-MPI:Each PE -> Independent

Hybrid:Hierarchal Structure


Preconditioned iterative linear solvers for unstructured grids on the earth simulator

08-APR22

  • Background

    • Vector/Scalar Processors

    • GeoFEM Project

    • Earth Simulator

    • Preconditioned Iterative Linear Solvers

  • Optimization Strategy on the Earth Simulator

    • BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications

    • Matrix Assembling

  • Summary & Future Works


Direct iterative methods for linear equations

08-APR22

Direct/Iterative Methods for Linear Equations

  • Direct Methods

    • Gaussian Elimination/LU Factorization.

      • compute A-1 directly.

    • Robust for wide range of applications.

    • More expensive than iterative methods (memory, CPU)

    • Not suitable for parallel and vector computation due to its global operations.

  • Iterative Methods

    • CG, GMRES, BiCGSTAB

    • Less expensive than direct methods, especially in memory.

    • Suitable for parallel and vector computing.

    • Convergence strongly depends on problems, boundary conditions (condition number etc.)

      • Preconditioning is required


Preconditioning for iterative methods

08-APR22

Preconditioning for Iterative Methods

  • Convergence rate of iterative solvers strongly depends on the spectral properties (eigenvalue distribution) of the coefficient matrix A.

  • A preconditioner M transforms the linear system into one with more favorable spectral properties

    • In "ill-conditioned" problems, "condition number" (ratio of max/min eigenvalue if A is symmetric) is large.

    • M transforms original equation Ax=b into A'x=b' where A'=M-1A, b'=M-1b

  • ILU (Incomplete LU Factorization) or IC (Incomplete Cholesky Factorization) are well-known preconditioners.


Strategy in geofem

08-APR22

Strategy in GeoFEM

  • Iterative method is the ONLY choice for large-scale parallel computing.

  • Preconditioning is important

    • general methods, such as ILU(0)/IC(0), cover wide range of applications.

    • problem specific methods.


Preconditioned iterative linear solvers for unstructured grids on the earth simulator

08-APR22

  • Background

    • Vector/Scalar Processors

    • GeoFEM Project

    • Earth Simulator

    • Preconditioned Iterative Linear Solvers

  • Optimization Strategy on the Earth Simulator

    • BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications

    • Matrix Assembling

  • Summary & Future Works


Block ic 0 cg solver on the earth simulator

08-APR22

Block IC(0)-CG Solver on the Earth Simulator

  • 3D Linear Elastic Problems (SPD)

  • Parallel Iterative Linear Solver

    • Node-based Local Data Structure

    • Conjugate-Gradient Method (CG): SPD

    • Localized Block IC(0) Preconditioning (Block Jacobi)

      • Modified IC(0): Non-diagonal components of original [A] are kept

    • Additive Schwartz Domain Decomposition(ASDD)

    • http://geofem.tokyo.rist.or.jp/

  • Hybrid Parallel Programming Model

    • OpenMP+MPI

    • Re-Ordering for Vector/Parallel Performance

    • Comparison with Flat MPI


Flat mpi vs hybrid1

08-APR22

Memory

Memory

P

E

P

E

P

E

P

E

P

E

P

E

P

E

P

E

Memory

Memory

P

E

P

E

P

E

P

E

P

E

P

E

P

E

P

E

Flat MPI vs. Hybrid

Flat-MPI:Each PE -> Independent

Hybrid:Hierarchal Structure


Local data structure node based partitioning internal nodes elements external nodes

08-APR22

PE#1

PE#1

PE#0

4

5

6

12

15

6

7

21

21

21

22

22

22

23

23

23

24

24

24

25

25

25

PE#0

1

11

2

3

14

13

4

5

17

17

17

18

18

18

19

19

19

16

16

16

20

20

20

10

1

2

3

7

8

9

10

12

12

12

13

13

13

14

14

14

11

10

12

11

11

11

15

15

15

8

9

11

12

10

9

11

12

7

7

7

8

8

8

9

9

9

6

6

6

10

10

10

5

9

6

3

8

8

6

4

4

5

5

5

5

PE#3

PE#2

1

2

7

7

1

2

3

PE#3

PE#2

Local Data StructureNode-based Partitioninginternal nodes - elements - external nodes


1 smp node 1 domain for hybrid programming model mpi communication among domains

08-APR22

1 SMP node => 1 domain for Hybrid Programming ModelMPI communication among domains


Basic strategy for parallel programming on the earth simulator

08-APR22

Basic Strategy for Parallel Programming on the Earth Simulator

  • Hypothesis

    • Explicit ordering is required for unstructured grids in order to achieve higher performance in factorization processes on SMP node and vector processors.


Ilu 0 ic 0 factorization

08-APR22

ILU(0)/IC(0) Factorization

do i= 2, n

do k= 1, i-1

if ((i,k)∈ NonZero(A)) then

aik := aik/akk

endif

do j= k+1, n

if ((i,j)∈ NonZero(A)) then

aij := aij - aik*akj

endif

enddo

enddo

enddo


Ilu ic preconditioning

08-APR22

ILU/IC Preconditioning

M = (L+D)D-1(D+U)

Mz= r

D-1(D+U)= z1

Forward Substitution

(L+D)z1= r : z1= D-1(r-Lz1)

Backward Substitution

(I+ D-1 U)zNEW = z1

z= z - D-1Uz

need to solve this equation


Ilu ic preconditioning1

08-APR22

ILU/IC Preconditioning

do i= 1, N

WVAL= R(i)

do j= 1, INL(i)

WVAL= WVAL - AL(i,j) * Z(IAL(i,j))

enddo

Z(i)= WVAL / D(i)

enddo

do i= N, 1, -1

SW = 0.0d0

do j= 1, INU(i)

SW= SW + AU(i,j) * Z(IAU(i,j))

enddo

Z(i)= Z(i) – SW / D(i)

enddo

Forward Substitution

(L+D)z= r : z= D-1(r-Lz)

M =(L+D)D-1(D+U)

L,D,U: A

Dependency…

You need the most

recent value of “z”

of connected nodes.

Vectorization/parallelization is difficult.

Backward Substitution

(I+ D-1 U)znew= zold : z= z - D-1Uz


Basic strategy for parallel programming on the earth simulator1

08-APR22

Basic Strategy for Parallel Programming on the Earth Simulator

  • Hypothesis

    • Explicit ordering is required for unstructured grids in order to achieve higher performance in factorization processes on SMP node and vector processors.

  • Re-Ordering for Highly Parallel/Vector Performance

    • Local operation and no global dependency

    • Continuous memory access

    • Sufficiently long loops for vectorization


Ilu ic preconditioning2

08-APR22

ILU/IC Preconditioning

do i= 1, N

WVAL= R(i)

do j= 1, INL(i)

WVAL= WVAL - AL(i,j) * Z(IAL(i,j))

enddo

Z(i)= WVAL / D(i)

enddo

do i= N, 1, -1

SW = 0.0d0

do j= 1, INU(i)

SW= SW + AU(i,j) * Z(IAU(i,j))

enddo

Z(i)= Z(i) – SW / D(i)

enddo

Forward Substitution

(L+D)z= r : z= D-1(r-Lz)

M =(L+D)D-1(D+U)

L,D,U: A

Dependency…

You need the most

recent value of “z”

of connected nodes.

Vectorization/parallelization is difficult.

Reordering:

Directly connected

nodes do not appear

in RHS.

Backward Substitution

(I+ D-1 U)znew= zold : z= z - D-1Uz


Basic strategy for parallel programming on the earth simulator2

08-APR22

Basic Strategy for Parallel Programming on the Earth Simulator

  • Hypothesis

    • Explicit ordering is required for unstructured grids in order to achieve higher performance in factorization processes on SMP node and vector processors.

  • Re-Ordering for Highly Parallel/Vector Performance

    • Local operation and no global dependency

    • Continuous memory access

    • Sufficiently long loops for vectorization

  • 3-Way Parallelism for Hybrid Parallel Programming

    • Inter Node : MPI

    • Intra Node : OpenMP

    • Individual PE : Vectorization


Preconditioned iterative linear solvers for unstructured grids on the earth simulator

08-APR22

Vector

SMP Parallel

Vector

SMP Parallel

These processes can be substituted by traditional multi-coloring (MC).

Re-Ordering Technique for Vector/Parallel ArchitecturesCyclic DJDS(RCM+CMC) Re-Ordering(Doi, Washio, Osoda and Maruyama (NEC))

1. RCM (Reverse Cuthil-Mckee)

2. CMC (Cyclic Multicolor)

3. DJDS re-ordering

4. Cyclic DJDS for SMP unit


Reordering coloring

08-APR22

Reordering = Coloring

  • COLOR: Unit of independent sets.

  • Elements grouped in the same “color” are independent from each other, thus parallel/vector operation is possible.

  • Many colors provide faster convergence, but shorter vector length.: Trade-Off !!

RCM (Reverse CM)

Red-Black (2 colors)

4 colors


Large scale sparse matrix storage for unstructured grids

08-APR22

2D-Storage

long vector length, many ZERO’s

1D-Storage (CRS)

memory saved, short vector length

Large Scale Sparse Matrix Storage for Unstructured Grids


Re ordering in each color according to non zero off diag component

08-APR22

Re-Ordering in each Coloraccording to Non-Zero Off-Diag. Component #

Elements on the same color are independent, thereforeintra-hyperplane re-ordering does not affect results

DJDS : Descending-Order Jagged Diagonal Storage


Cyclic djds mc cm rcm cyclic re ordering for smp units load balancing among pes

08-APR22

iS+1

iv0+1

iE

Cyclic DJDS (MC/CM-RCM)Cyclic Re-Ordering for SMP unitsLoad-balancing among PEs

do iv= 1, NCOLORS

!$omp parallel do

do ip= 1, PEsmpTOT

iv0= STACKmc(PEsmpTOT*(iv-1)+ip- 1)

do j= 1, NLhyp(iv)

iS= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip-1)

iE= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip )

!cdir nodep

do i= iv0+1, iv0+iE-iS

k= i+iS - iv0

kk= IAL(k)

(Important Computations)

enddo

enddo

enddo

enddo

npLX1= NLmax * PEsmpTOT

INL(0:NLmax*PEsmpTOT*NCOLORS)


Difference between flat mpi hybrid

08-APR22

Difference betweenFlat MPI & Hybrid

  • Most of the efforts of re-ordering are for vectorization.

  • If you have a long vector, just divide it and distribute the segments to PEs on SMP nodes.

  • Source codes of Hybrid and Flat MPI are not so different.

    • Flat MPI corresponds to Hybrid where PE/SMP node=1.

    • In other words, Flat MPI code is sufficiently complicated.


Cyclic djds mc cm rcm for forward backward substitution in bilu factorization

08-APR22

Cyclic DJDS (MC/CM-RCM)for Forward/Backward Substitution in BILU Factorization


Simple 3d cubic model

08-APR22

z

Uniform Distributed Force in z-dirrection @ z=Zmin

Uy=0 @ y=Ymin

Ux=0 @ x=Xmin

Nz-1

Ny-1

y

Uz=0 @ z=Zmin

Nx-1

x

Simple 3D Cubic Model


Effect of ordering

08-APR22

Effect of Ordering


Effect of re ordering

08-APR22

Long Loops

Continuous Access

Short Loops

Irregular Access

Short Loops

Continuous Access

Effect of Re-Ordering


Matrix storage loops

08-APR22

Matrix Storage, Loops

do iv= 1, NVECT

iv0= STACKmc(iv-1)

do j= 1, NLhyp(iv)

iS= index_L(NL*(iv-1)+ j-1)

iE= index_L(NL*(iv-1)+ j )

do i= iv0+1, iv0+iE-iS

k= i+iS - iv0

kk= item_L(k)

Z(i)= Z(i) - AL(k)*Z(kk)

enddo

iS= STACKmc(iv-1) + 1

iE= STACKmc(iv )

do i= iS, iE

Z(i)= Z(i)/DD(i)

enddo

enddo

enddo

DJDS

  • DJDS (Descending order Jagged Diagonal Storage) with long innermost loops is suitable for vector processors.

DCRS

do i= 1, N

SW= WW(i,Z)

isL= index_L(i-1)+1

ieL= index_L(i)

do j= isL, ieL

k= item_L(j)

SW= SW - AL(j)*Z(k)

enddo

Z(i)= SW/DD(i)

enddo

  • Reduction type loop of DCRS is more suitable for cache-based scalar processor because of its localized operation.


Effect of re ordering results on 1 smp node color 99 fixed re ordering is really required

08-APR22

Effect of Vector Length

X10

+ Re-Ordering X100

22 GFLOPS,

34% of the Peak

Ideal Performance: 40%-45%

for Single CPU

Effect of Re-OrderingResults on 1 SMP nodeColor #: 99 (fixed)Re-Ordering is REALLY required !!!

●■▲


Effect of re ordering results on 1 smp node color 99 fixed re ordering is really required1

08-APR22

Effect of Re-OrderingResults on 1 SMP nodeColor #: 99 (fixed)Re-Ordering is REALLY required !!!

●■▲

80x80x80 case (1.5M DOF)

● 212 iter’s, 11.2 sec.

■ 212 iter’s, 143.6 sec.

▲ 203 iter’s, 674.2 sec.


3d elastic simulation problem size gflops earth simulator 1 smp node 8 pe s

08-APR22

3D Elastic SimulationProblem Size~GFLOPSEarth Simulator1 SMP node (8 PE’s)

Flat-MPI is better.

Nice Intra-Node MPI.

Flat-MPI

23.4 GFLOPS, 36.6 % of Peak

Hybrid (OpenMP)

21.9 GFLOPS, 34.3 % of Peak


Preconditioned iterative linear solvers for unstructured grids on the earth simulator

08-APR22

IBM SP3

(LBNL)

Hitachi

SR11000/J2

(U.Tokyo)

1.50

9.20

0.623

8.00

.072-.076

(4.8-5.1)

.880-.973

(9.6-10.6)

.122

(8.1)

1.34

(14.5)

0.413

0.870

1.00

12.0

16.3

4.7 *

Earth

Simulator

Hitachi

SR8000/MPP

(U.Tokyo)

Peak Performance

GFLOPS

8.00

1.80

Measured Memory

BW (STREAM)

(GB/sec/PE)

26.6

2.85

Estimated

Performance/Core

GFLOPS

(% of peak)

2.31-3.24

(28.8-40.5)

.291-.347

(16.1-19.3)

Measured

Performance/Core

2.93

(36.6)

.335

(18.6)

BYTE/FLOP

3.325

1.583

Comm. BW

(GB/sec/Node)

12.3

1.60

* IBM p595

J.T.Carter

et al.

MPI Latency

(msec)

5.6-7.7

6-20


Preconditioned iterative linear solvers for unstructured grids on the earth simulator

08-APR22

3D Elastic SimulationProblem Size~GFLOPSHitachi-SR8000-MPP withPseudo Vectorization1 SMP node (8 PE’s)

Hybrid is better.

Low Intra-Node MPI.

Flat-MPI

2.17 GFLOPS, 15.0 % of Peak

Hybrid (OpenMP)

2.68 GFLOPS, 18.6 % of Peak


3d elastic simulation problem size gflops ibm sp3 nersc 1 smp node 8 pe s

08-APR22

3D Elastic SimulationProblem Size~GFLOPSIBM-SP3 (NERSC) 1 SMP node (8 PE’s)

Cache is well-utilized

in Flat-MPI.

Flat-MPI

Hybrid (OpenMP)


3d elastic simulation problem size gflops hitachi sr11000 j2 u tokyo 1 smp node 8 pe s

08-APR22

3D Elastic SimulationProblem Size~GFLOPSHitachi SR11000/J2 (U.Tokyo) 1 SMP node (8 PE’s)

Flat-MPI

Hybrid (OpenMP)


Smp node 10 up to 176 nodes 1408 pes problem size for each smp node is fixed pdjds cm rcm color 99

08-APR22

SMP node # > 10up to 176 nodes (1408 PEs)Problem size for each SMP node is fixed.PDJDS-CM/RCM, Color #: 99


3d elastic model large case 256x128x128 smp node up to 2 214 592 512 dof

08-APR22

3D Elastic Model (Large Case)256x128x128/SMP node, up to 2,214,592,512 DOF

Parallel Work Ratio

GFLOPS rate

○Hybrid

●Flat MPI

3.8TFLOPS for 2.2G DOF

176 nodes (33.8% of peak)

●:Flat MPI, ○:Hybrid


3d elastic model small case 64x64x64 smp node up to 125 829 120 dof

08-APR22

3D Elastic Model (Small Case)64x64x64/SMP node, up to 125,829,120 DOF

Parallel Work Ratio

GFLOPS rate

○Hybrid

●Flat MPI

●:Flat MPI, ○:Hybrid


Hybrid outperforms flat mpi

08-APR22

Hybrid outperforms Flat-MPI

  • when ...

    • number of SMP node (PE) is large.

    • problem size/node is small.

  • because flat-MPI has ...

    • as 8 times as many communication processes

    • as TWICE as large communication/computation ratio

  • Effect of communication becomes significant if number of SMP node (or PE) is large.

  • Performance Estimation by D.Kerbyson (LANL)

    • LA-UR-02-5222

    • relatively larger communication latency of ES


Flat mpi and hybrid

08-APR22

Flat-MPI and Hybrid

Hybrid

Flat-MPI


Preconditioned iterative linear solvers for unstructured grids on the earth simulator

08-APR22

IBM SP3

(LBNL)

Hitachi

SR11000/J2

(U.Tokyo)

1.50

9.20

0.623

8.00

.072-.076

(4.8-5.1)

.880-.973

(9.6-10.6)

.122

(8.1)

1.34

(14.5)

0.413

0.870

1.00

12.0

16.3

4.7 *

Earth

Simulator

Hitachi

SR8000/MPP

(U.Tokyo)

Peak Performance

GFLOPS

8.00

1.80

Measured Memory

BW (STREAM)

(GB/sec/PE)

26.6

2.85

Estimated

Performance/Core

GFLOPS

(% of peak)

2.31-3.24

(28.8-40.5)

.291-.347

(16.1-19.3)

Measured

Performance/Core

2.93

(36.6)

.335

(18.6)

BYTE/FLOP

3.325

1.583

Comm. BW

(GB/sec/Node)

12.3

1.60

* IBM p595

J.T.Carter

et al.

MPI Latency

(msec)

5.6-7.7

6-20


Why communication overhead

08-APR22

Why Communication Overhead ?

  • latency of network

  • finite bandwidth of network

  • synchronization at SEND/RECV, ALLREDUCE etc.

  • memory performance in boundary communications (memory COPY)


Domain to domain communication exchange boundary information send recv

08-APR22

do neib= 1, NEIBPETOT

istart= IMPORT_INDEX(neib-1)

inum = IMPORT_INDEX(neib ) - istart

call MPI_IRECV

(WR(istart+1), inum, MPI_DOUBLE_PRECISION, &

NEIBPE(neib), 0, SOLVER_COMM, &

req2(neib), ierr)

enddo

call MPI_WAITALL (NEIBPETOT, req2, sta2, ierr)

do neib= 1, NEIBPETOT

istart= IMPORT_INDEX(neib-1)

inum = IMPORT_INDEX(neib ) - istart

do k= istart+1, istart+inum

X(IMPORT_NODE(k))= WR(k)

enddo

enddo

call MPI_WAITALL (NEIBPETOT, req1, sta1, ierr)

return

end

RECEIVE

do neib= 1, NEIBPETOT

istart= EXPORT_INDEX(neib-1)

inum = EXPORT_INDEX(neib ) - istart

do k= istart+1, istart+inum

WS(k)= X(EXPORT_NODE(k))

enddo

call MPI_ISEND

(WS(istart+1), inum, MPI_DOUBLE_PRECISION, &

NEIBPE(neib), 0, SOLVER_COMM, &

req1(neib), ierr)

enddo

SEND

Domain-to-Domain CommunicationExchange Boundary Information (SEND/RECV)

subroutine SOLVER_SEND_RECV &

(N, NEIBPETOT, NEIBPE, &

IMPORT_INDEX, IMPORT_NODE, &

EXPORT_INDEX, EXPORT_NODE, &

WS, WR, X, SOLVER_COMM, my_rank)

implicit REAL*8 (A-H,O-Z)

include 'mpif.h'

parameter (KREAL= 8)

integer IMPORT_INDEX(0:NEIBPETOT), IMPORT_NODE(N)

integer EXPORT_INDEX(0:NEIBPETOT), EXPORT_NODE(N)

integer SOLVER_COMM, my_rank

integer req1(NEIBPETOT), req2(NEIBPETOT)

integer sta1(MPI_STATUS_SIZE, NEIBPETOT)

integer sta2(MPI_STATUS_SIZE, NEIBPETOT)

real(kind=KREAL) X(N), NEIBPE(NEIBPETOT), WS(N), WR(N)


Domain to domain communication exchange boundary information send recv1

08-APR22

do neib= 1, NEIBPETOT

istart= IMPORT_INDEX(neib-1)

inum = IMPORT_INDEX(neib ) - istart

call MPI_IRECV

(WR(istart+1), inum, MPI_DOUBLE_PRECISION, &

NEIBPE(neib), 0, SOLVER_COMM, &

req2(neib), ierr)

enddo

call MPI_WAITALL (NEIBPETOT, req2, sta2, ierr)

do neib= 1, NEIBPETOT

istart= IMPORT_INDEX(neib-1)

inum = IMPORT_INDEX(neib ) - istart

do k= istart+1, istart+inum

X(IMPORT_NODE(k))= WR(k)

enddo

enddo

call MPI_WAITALL (NEIBPETOT, req1, sta1, ierr)

return

end

RECEIVE

do neib= 1, NEIBPETOT

istart= EXPORT_INDEX(neib-1)

inum = EXPORT_INDEX(neib ) - istart

do k= istart+1, istart+inum

WS(k)= X(EXPORT_NODE(k))

enddo

call MPI_ISEND

(WS(istart+1), inum, MPI_DOUBLE_PRECISION, &

NEIBPE(neib), 0, SOLVER_COMM, &

req1(neib), ierr)

enddo

SEND

Domain-to-Domain CommunicationExchange Boundary Information (SEND/RECV)

subroutine SOLVER_SEND_RECV &

(N, NEIBPETOT, NEIBPE, &

IMPORT_INDEX, IMPORT_NODE, &

EXPORT_INDEX, EXPORT_NODE, &

WS, WR, X, SOLVER_COMM, my_rank)

implicit REAL*8 (A-H,O-Z)

include 'mpif.h'

parameter (KREAL= 8)

integer IMPORT_INDEX(0:NEIBPETOT), IMPORT_NODE(N)

integer EXPORT_INDEX(0:NEIBPETOT), EXPORT_NODE(N)

integer SOLVER_COMM, my_rank

integer req1(NEIBPETOT), req2(NEIBPETOT)

integer sta1(MPI_STATUS_SIZE, NEIBPETOT)

integer sta2(MPI_STATUS_SIZE, NEIBPETOT)

real(kind=KREAL) X(N), NEIBPE(NEIBPETOT), WS(N), WR(N)


Communication overhead

08-APR22

Communication Overhead

Memory

Copy

Comm.

Latency

Comm.

BandWidth


Communication overhead1

08-APR22

Communication Overhead

Memory

Copy

Comm.

Latency

Comm.

BandWidth

depends on

message size

depends on

message size


Communication overhead earth simulator

08-APR22

Communication OverheadEarth Simulator

Comm.

Latency


Communication overhead hitachi sr11000 ibm sp3 etc

08-APR22

Communication OverheadHitachi SR11000, IBM SP3 etc.

Memory

Copy

Comm.

BandWidth


Communication overhead synchronization overhead

08-APR22

Communication Overhead= Synchronization Overhead


Communication overhead synchronization overhead1

08-APR22

Communication Overhead= Synchronization Overhead

Memory

Copy

Comm.

Latency

Comm.

BWTH


Communication overhead synchronization overhead earth simulator

08-APR22

Communication Overhead= Synchronization OverheadEarth Simulator

Comm.

Latency


Communication overhead weak scaling earth simulator

08-APR22

Communication OverheadWeak Scaling: Earth Simulator

Effect of message size

is small. Effect of latency

is large.

Memory-copy is so fast.


Communication overhead weak scaling ibm sp 3

08-APR22

Communication OverheadWeak Scaling: IBM SP-3

Effect of message size

is more significant.


Communication overhead weak scaling hitachi sr11000 j2 8cores node

08-APR22

Communication OverheadWeak Scaling: Hitachi SR11000/J2 (8cores/node)


Summary

08-APR22

Summary

  • Hybrid Parallel Programming Model on SMP Cluster Architecture with Vector Processors for Unstructured Grids

  • Nice parallel performance for both inter/intra SMP node on ES, 3.8TFLOPS for 2.2G DOF on 176 nodes (33.8%) in 3D linear-elastic problem using BIC(0)-CG method.

    • N.Kushida (student of Prof.Okuda) attained >10 TFLOPS using 512 nodes for >3G DOF problem.

  • Re-Ordering is really required


Summary cont

08-APR22

Summary (cont.)

  • Hybrid vs. Flat MPI

    • Flat-MPI is better for small number of SMP nodes.

    • Hybrid is better for large number of SMP nodes: Especially when problem size is rather small.

    • Flat MPI: Communication, Hybrid: Memory

    • depends on application, problem size etc.

    • Hybrid is much more sensitive to color numbers than flat MPI due to synchronization overhead of OpenMP.

  • In Mat-Vec. operations, difference is not so significant.


Preconditioned iterative linear solvers for unstructured grids on the earth simulator

08-APR22

  • Background

    • Vector/Scalar Processors

    • GeoFEM Project

    • Earth Simulator

    • Preconditioned Iterative Linear Solvers

  • Optimization Strategy on the Earth Simulator

    • BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications

    • Matrix Assembling

  • Summary & Future Works


Cube benchmark

08-APR22

“CUBE” Benchmark

  • 3D linear elastic applications on cubes for a wide range of problem size.

  • Hardware

    • Single CPU

    • Earth Simulator

    • AMD Opteron (1.8GHz)


Time for 3x64 3 786 432 dof

08-APR22

DCRS

sec.

(MFLOPS)

DJDS original

sec.

(MFLOPS)

ES

8.0 GFLOPS

Matrix

28.6

(291)

34.2

(240)

21.7

(3246)

Solver

360

(171)

Total

389

55.9

Opteron

1.8GHz

3.6 GFLOPS

Matrix

10.2

(818)

12.4

(663)

271

(260)

Solver

225

(275)

Total

235

283

Time for 3x643=786,432 DOF


Matrix solver

08-APR22

Matrix

Matrix+Solver

Solver

DJDS on ES

original

DJDS on Opteron

original


Computation time vs problem size

08-APR22

Computation Time vs. Problem Size

Matrix

Solver

Total

ES (DJDS original)

Opteron

(DJDS original)

Opteron (DCRS)


Matrix assembling formation part is rather expensive

08-APR22

Matrix assembling/formation part is rather expensive

  • This part should be also optimized for vector processors.

  • For example, in nonlinear simulations such as elasto-plastic solid simulations, or fully coupled Navier-Stokes flow simulations, matrices must be updated for every nonlinear iterations.

  • This part strongly depends on applications/physics, therefore it’s very difficult to develop general libraries, such as those of iterative linear solvers.

    • also includes complicated processes which are difficult to be “vectorized”


Typical procedure for calculating coefficient matrix in fem

08-APR22

Typical Procedure for Calculating Coefficient Matrix in FEM

  • Apply Galerkin’s method on each element.

  • Integration over each element, and get element-matrix.

  • Element matrices are accumulated to each node, and global matrices are obtained

    => Global linear equations

  • Matrix assembling/formation is embarrassingly parallel procedure due to its element-by-element feature


Element by element operations

08-APR22

Element-by-Element Operations

  • Integration over each element => element-matrix

  • Element matrices are accumulated to each node

    => global-matrix

  • Linear equations for each node

19

20

21

22

23

24

Elements

13

14

15

16

17

18

7

8

9

10

11

12

1

2

3

4

5

6


Element by element operations1

08-APR22

Element-by-Element Operations

  • Integration over each element => element-matrix

  • Element matrices are accumulated to each node

    => global-matrix

  • Linear equations for each node

29

30

31

32

33

34

35

19

20

21

22

23

24

Nodes

22

23

24

25

26

27

28

13

14

15

16

17

18

15

16

17

18

19

20

21

7

8

9

10

11

12

8

9

10

11

12

13

14

1

2

3

4

5

6

1

2

3

4

5

6

7


Element by element operations2

08-APR22

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Element-by-Element Operations

  • Integration over each element => element-matrix

  • Element matrices are accumulated to each node

    => global-matrix

  • Linear equations for each node


Element by element operations3

08-APR22

Element-by-Element Operations

  • Integration over each element => element-matrix

  • Element matrices are accumulated to each node

    => global-matrix

  • Linear equations for each node


Element by element operations4

08-APR22

Element-by-Element Operations

  • Integration over each element => element-matrix

  • Element matrices are accumulated to each node

    => global-matrix

  • Linear equations for each node

29

30

31

32

33

34

35

22

23

24

25

26

27

28

15

16

17

18

19

20

21

8

9

10

11

12

13

14

1

2

3

4

5

6

7


Element by element operations5

08-APR22

Element-by-Element Operations

  • Integration over each element => element-matrix

  • Element matrices are accumulated to each node

    => global-matrix

  • Linear equations for each node


Element by element operations6

08-APR22

Element-by-Element Operations

22

23

24

13

14

15

16

17

7

8

8

9

10

  • If you calculate a23,16 and a16,23, you have to consider contribution by both of 13th and 14th elements.


Current approach

08-APR22

Current Approach

do icel= 1, ICELTOT

do ie= 1, 4

do je= 1, 4

- assemble element-matrix

- accumulated element-matrix to global-matrix

enddo

enddo

enddo

22

23

24

13

14

15

16

17

7

8

8

9

10


Current approach1

08-APR22

Current Approach

do icel= 1, ICELTOT

do ie= 1, 4

do je= 1, 4

- assemble element-matrix

- accumulat element-matrix to global-matrix

enddo

enddo

enddo

22

23

24

13

14

Local Node ID

15

16

17

7

8

8

9

10

4

3

Local Node ID

for each bi-linear 4-node

element

1

2


Current approach2

08-APR22

Current Approach

do icel= 1, ICELTOT

do ie= 1, 4

do je= 1, 4

- assemble element-matrix

- accumulat element-matrix to global-matrix

enddo

enddo

enddo

22

23

24

13

14

Local Node ID

15

16

17

7

8

8

9

10

  • Nice for cache reuse because of localized operations

  • Not suitable for vector processors

    • a16,23 and a23,16 might not be calculated properly.

    • Short innermost loops

    • There are many “if-then-else” s


Inside the loop integration at gaussian quadrature points

08-APR22

Inside the loop: integration at Gaussian quadrature points

do jpn= 1, 2

do ipn= 1, 2

coef= dabs(DETJ(ipn,jpn))*WEI(ipn)*WEI(jpn)

PNXi= PNX(ipn,jpn,ie)

PNYi= PNY(ipn,jpn,ie)

PNXj= PNX(ipn,jpn,je)

PNYj= PNY(ipn,jpn,je)

a11= a11 + (valX*PNXi*PNXj + valB*PNYi*PNYj)*coef

a22= a22 + (valX*PNYi*PNYj + valB*PNXi*PNXj)*coef

a12= a12 + (valA*PNXi*PNYj + valB*PNXj*PNYi)*coef

a21= a21 + (valA*PNYi*PNXj + valB*PNYj*PNXi)*coef

enddo

enddo


Remedy

08-APR22

Remedy

do icel= 1, ICELTOT

do ie= 1, 4

do je= 1, 4

- assemble element-matrix

- accumulat element-matrix to global-matrix

enddo

enddo

enddo

22

23

24

13

14

15

16

17

7

8

8

9

10

  • a16,23 and a23,16 might not be calculated properly.

    • coloring the elements: elements which do not share any nodes are in same color.


Coloring of elements

08-APR22

Coloring of Elements

29

30

31

32

33

34

35

22

23

24

25

26

27

28

15

16

17

18

19

20

21

8

9

10

11

12

13

14

1

2

3

4

5

6

7


Coloring of elements1

08-APR22

Coloring of Elements

Elements sharing the 16th node are assigned to different colors

29

30

31

32

33

34

35

22

23

24

25

26

27

28

15

16

17

18

19

20

21

8

9

10

11

12

13

14

1

2

3

4

5

6

7


Remedy1

08-APR22

Remedy

do icel= 1, ICELTOT

do ie= 1, 4

do je= 1, 4

- assemble element-matrix

- accumulat element-matrix to global-matrix

enddo

enddo

enddo

22

23

24

13

14

15

16

17

7

8

8

9

10

  • a16,23 and a23,16 might not be calculated properly.

    • coloring the elements: elements which do not share any nodes are in same color.

  • Short innermost loops

    • loop exchange


Remedy2

08-APR22

Remedy

do icel= 1, ICELTOT

do ie= 1, 4

do je= 1, 4

- assemble element-matrix

- accumulat element-matrix to global-matrix

enddo

enddo

enddo

22

23

24

13

14

15

16

17

7

8

8

9

10

  • a16,23 and a23,16 might not be calculated properly.

    • coloring the elements: elements which do not share any nodes are in same color.

  • Short innermost loops

    • loop exchange

  • There are many “if-then-else” s

    • define ELEMENT-to-MATRIX array


Define element to matrix array

08-APR22

Define ELEMENT-to-MATRIX array

22

23

24

13

14

22

23

23

24

13

14

15

16

17

Local Node ID

7

8

15

16

16

17

8

9

10

if

kkU= index_U(16-1+k) and

item_U(kkU)= 23 then

ELEMmat(13,2,3)= +kkU

ELEMmat(14,1,4)= +kkU

endif

if

kkL= index_L(23-1+k) and

item_L(kkL)= 16 then

ELEMmat(13,3,2)= -kkL

ELEMmat(14,4,1)= -kkL

endif

ELEMmat(icel, ie, je)

Local Node ID

Element ID


Define element to matrix array1

08-APR22

Define ELEMENT-to-MATRIX array

22

23

24

13

14

22

23

23

24

13

14

15

16

17

Local Node ID

7

8

15

16

16

17

8

9

10

if

kkU= index_U(16-1+k) and

item_U(kkU)= 23 then

ELEMmat(13,2,3)= +kkU

ELEMmat(14,1,4)= +kkU

endif

if

kkL= index_L(23-1+k) and

item_L(kkL)= 16 then

ELEMmat(13,3,2)= -kkL

ELEMmat(14,4,1)= -kkL

endif

“ELEMmat” specifies relationship between node pairs of each node of each element and address of global coefficient matrix.


Optimized procedure

08-APR22

Optimized Procedure

do icol= 1, NCOLOR_E_tot

do ie= 1, 4

do je= 1, 4

do ic0= index_COL(icol-1)+1, indexCOL_(icol)

icel= item_COL(ic0)

- define “ELEMmat” array

enddo

enddo

enddo

enddo

  • Extra Computation for

  • ELEMmat

  • Extra Storage for

  • ELEMmat array

  • element-matrix components for elements in each color

  • < 10% increase

do icol= 1, NCOLOR_E_tot

do ie= 1, 4

do je= 1, 4

do ic0= index_COL(icol-1)+1, indexCOL_(icol)

icel= item_COL(ic0)

- assemble element-matrix

enddo

enddo

enddo

do ie= 1, 4

do je= 1, 4

do ic0= index_COL(icol-1)+1, indexCOL_(icol)

icel= item_COL(ic0)

- accumulate element-matrix to global-matrix

enddo

enddo

enddo

enddo


Optimized procedure1

08-APR22

Optimized Procedure

PART I

“Integer” operations for “ELEMmat”

In nonlinear cases, this part should be done just once (before initial iteration), as long as mesh connectivity does not change.

do icol= 1, NCOLOR_E_tot

do ie= 1, 4

do je= 1, 4

do ic0= index_COL(icol-1)+1, indexCOL_(icol)

icel= item_COL(ic0)

- define “ELEMmat” array

enddo

enddo

enddo

enddo

do icol= 1, NCOLOR_E_tot

do ie= 1, 4

do je= 1, 4

do ic0= index_COL(icol-1)+1, indexCOL_(icol)

icel= item_COL(ic0)

- assemble element-matrix

enddo

enddo

enddo

do ie= 1, 4

do je= 1, 4

do ic0= index_COL(icol-1)+1, indexCOL_(icol)

icel= item_COL(ic0)

- accumulate element-matrix to global-matrix

enddo

enddo

enddo

enddo


Optimized procedure2

08-APR22

Optimized Procedure

PART I

“Integer” operations for “ELEMmat”

In nonlinear cases, this part should be done just once (before initial iteration), as long as mesh connectivity does not change.

do icol= 1, NCOLOR_E_tot

do ie= 1, 4

do je= 1, 4

do ic0= index_COL(icol-1)+1, indexCOL_(icol)

icel= item_COL(ic0)

- define “ELEMmat” array

enddo

enddo

enddo

enddo

do icol= 1, NCOLOR_E_tot

do ie= 1, 4

do je= 1, 4

do ic0= index_COL(icol-1)+1, indexCOL_(icol)

icel= item_COL(ic0)

- assemble element-matrix

enddo

enddo

enddo

do ie= 1, 4

do je= 1, 4

do ic0= index_COL(icol-1)+1, indexCOL_(icol)

icel= item_COL(ic0)

- accumulate element-matrix to global-matrix

enddo

enddo

enddo

enddo

PART II

“Floating” operations for matrix assembling/accumulation.

In nonlinear cases, this part is repeated for every nonlinear iteration.


Time for 3x64 3 786 432 dof1

08-APR22

Time for 3x643=786,432 DOF

DCRS

sec.

(MFLOPS)

DJDS original

sec.

(MFLOPS)

DJDS improved

sec.

(MFLOPS)

ES

8.0 GFLOPS

12.5

(643)

Matrix

28.6

(291)

34.2

(240)

21.7

(3246)

21.7

(3246)

Solver

360

(171)

Total

389

55.9

34.2

21.2

(381)

Opteron

1.8GHz

3.6 GFLOPS

Matrix

10.2

(818)

12.4

(663)

271

(260)

271

(260)

Solver

225

(275)

Total

235

283

292


Time for 3x64 3 786 432 dof2

08-APR22

Time for 3x643=786,432 DOF

DCRS

sec.

(MFLOPS)

DJDS original

sec.

(MFLOPS)

DJDS improved

sec.

(MFLOPS)

ES

8.0 GFLOPS

12.5

(643)

Matrix

28.6

(291)

34.2

(240)

21.7

(3246)

21.7

(3246)

Solver

360

(171)

Total

389

55.9

34.2

21.2

(381)

Opteron

1.8GHz

3.6 GFLOPS

Matrix

10.2

(818)

12.4

(663)

271

(260)

271

(260)

Solver

225

(275)

Slower than original

because of long innermost

loops (data locality has been

lost)

Total

235

283

292


Matrix solver1

08-APR22

Matrix

Matrix+Solver

Solver

DJDS on ES

improved

DJDS on ES

original

DJDS on Opteron

improved

DJDS on Opteron

original


Computation time vs problem size1

08-APR22

Computation Time vs. Problem Size

Matrix

Solver

Total

ES (DJDS improved)

Opteron

(DJDS improved)

Opteron (DCRS)


Matrix computation time for improved version of djds

08-APR22

“Matrix” computation time for improved version of DJDS

Integer

ES

Opteron

Floating


Optimization of matrix assembling formation on es

08-APR22

Optimization of “Matrix” assembling/formation on ES

  • DJDS has been much improved compared to the original one, but it’s still slower than DCRS version on Opteron.

  • “Integer” operation part is slower.

  • But, “floating” operation is much faster than Opteron.

  • In nonlinear simulations, “integer” operation is executed only once (just before initial iteration), therefore, ES outperforms Opteron if the number of nonlinear iterations is more than 2.


Suppose virtual mode where

08-APR22

Suppose “virtual” mode where …

  • On scalar processor

    • “Integer” operation part

  • On vector processor

    • “floating” operation part

    • linear solvers

  • Scalar performance of ES (500MHz) is smaller than that of Pentium III


Time for 3x64 3 786 432 dof3

08-APR22

Time for 3x643=786,432 DOF

DCRS

sec.

(MFLOPS)

DJDS improved

sec.

(MFLOPS)

DJDS virtual

sec.

(MFLOPS)

ES

8.0 GFLOPS

12.5

(643)

Matrix

28.6

(291)

1.88

(4431)

21.7

(3246)

21.7

(3246)

Solver

360

(171)

Total

389

34.2

23.6

21.2

(381)

Opteron

1.8GHz

3.6 GFLOPS

Matrix

10.2

(818)

271

(260)

Solver

225

(275)

Total

235

292


Summary vectorization of fem appl

08-APR22

Summary: Vectorization of FEM appl.

  • NOT so easy

  • FEM’s good features of local operations are not necessarily suitable for vector processors.

    • Preconditioned iterative solvers can be vectorized rather easier, because their target is “global” matrix.

  • Sometimes, major revision of original codes are required

    • Usually, more memory, more lines, additional operations …

  • Performance of optimized codes for vector processor is not necessarily good on scalar processors (e.g. matrix assembling of FEM)


  • Login