Preconditioned iterative linear solvers for unstructured grids on the earth simulator
This presentation is the property of its rightful owner.
Sponsored Links
1 / 115

Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator PowerPoint PPT Presentation


  • 46 Views
  • Uploaded on
  • Presentation posted in: General

Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator. Kengo Nakajima [email protected] Supercomputing Division, Information Technology Center, The University of Tokyo Japan Agency for Marine-Earth Science and Technology. April 22, 2008. Overview.

Download Presentation

Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

Kengo Nakajima

[email protected]

Supercomputing Division, Information Technology Center,

The University of Tokyo

Japan Agency for Marine-Earth Science and Technology

April 22, 2008


08-APR22

Overview

  • Background

    • Vector/Scalar Processors

    • GeoFEM Project

    • Earth Simulator

    • Preconditioned Iterative Linear Solvers

  • Optimization Strategy on the Earth Simulator

    • BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications

    • Matrix Assembling

  • Summary & Future Works


Large-Scale Computing

  • Scalar Processor

    • Big gap between clock rate and memory bandwidth.

    • Very low sustained/peak performance ratio (<10%)

08-APR22


Scalar ProcessorsCPU-Cache-Memory: Hierarchical Structure

CPU

Register

FAST

Small Capacity (MB)

Expensive

Large (100M of Transistors)

Cache

SLOW

Main Memory

Large Capacity (GB)

Cheap

08-APR22


Large-Scale Computing

  • Scalar Processor

    • Big gap between clock rate and memory bandwidth.

    • Very low sustained/peak performance ratio (<10%)

  • Vector Processor

    • Very high sustained/peak performance ratio

      • e.g. 35 % for FEM applications on the Earth Simulator

    • requires …

      • very special tuning

      • sufficiently long loops (= large-scale problem size) for certain performance

    • Suitable for simple computation.

08-APR22


Vector ProcessorsVector Register & Fast Memory

  • Parallel Processing for Simple DO Loops.

  • Suitable for Simple & Large Computation

Vector Processor

do i= 1, N

A(i)= B(i) + C(i)

enddo

Vector

Register

Very

FAST

Main Memory

08-APR22


Large-Scale Computing

  • Scalar Processor

    • Big gap between clock rate and memory bandwidth.

    • Very low sustained/peak performance ratio (<10%)

  • Vector Processor

    • Very high sustained/peak performance ratio

      • e.g. 35 % for FEM applications on the Earth Simulator

    • requires …

      • very special tuning

      • sufficiently long loops (= large-scale problem size) for certain performance

    • Suitable for simple computation.

08-APR22


IBM SP3

(LBNL)

Hitachi

SR11000/J2

(U.Tokyo)

1.50

9.20

0.623

8.00

.072-.076

(4.8-5.1)

.880-.973

(9.6-10.6)

.122

(8.1)

1.34

(14.5)

0.413

0.870

1.00

12.0

16.3

4.7 *

Earth

Simulator

Hitachi

SR8000/MPP

(U.Tokyo)

Peak Performance

GFLOPS

8.00

1.80

Measured Memory

BW (STREAM)

(GB/sec/PE)

26.6

2.85

Estimated

Performance/Core

GFLOPS

(% of peak)

2.31-3.24

(28.8-40.5)

.291-.347

(16.1-19.3)

Measured

Performance/Core

2.93

(36.6)

.335

(18.6)

BYTE/FLOP

3.325

1.583

Comm. BW

(GB/sec/Node)

12.3

1.60

* IBM p595

J.T.Carter

et al.

MPI Latency

(msec)

5.6-7.7

6-20

08-APR22


Typical Behavior …

40 % of peak

8 % of peak

IBM-SP3:

Performance is good for small problems

due to cache effect.

Earth Simulator:

Performance is good for large scale problems due to long vector length.

08-APR22


Parallel ComputingStrong Scaling (Fixed Prob. Size)

Ideal

Ideal

Performance

Performance

PE#

PE#

IBM-SP3:

Super-scalar effect for small number

of PE’s. Performance decreases for

many PE’s due to comm. overhead.

Earth Simulator:

Performance decreases for many PE’s

due to comm. overhead and small vector length.

08-APR22


Improvement of Memory Performance on IBM SP3 ⇒ Hitachi SR11000/J2

Memory Performance

(BWDTH, latency etc.)

■ Flat-MPI/DCRS

□ Hybrid/DCRS

2.3 GHz

8.0 GB/sec

18M L3 cache/PE

IBM SP-3

POWER3

375 MHz

1.0 GB/sec

8M L2 cache/PE

SR11000/J2

POWER5+

08-APR22


08-APR22

My History with Vector Computers

  • Cray-1S (1985-1988)

    • Mitsubishi Research Institute (MRI)

  • Cray-YMP (1988-1995)

    • MRI, University of Texas at Austin

  • Fujitsu VP, VPP series (1985-1999)

    • JAERI, PNC

  • NEC SX-4 (1997-2001)

    • CCSE/JAERI

  • Hitachi SR2201 (1997-2004)

    • University of Tokyo, CCSE/JAERI

  • Hitachi SR8000 (2000-2007)

    • University of Tokyo

  • Earth Simulator (2002-)


08-APR22

  • Background

    • Vector/Scalar Processors

    • GeoFEM Project

    • Earth Simulator

    • Preconditioned Iterative Linear Solvers

  • Optimization Strategy on the Earth Simulator

    • BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications

    • Matrix Assembling

  • Summary & Future Works


08-APR22

GeoFEM: FY.1998-2002http://geofem.tokyo.rist.or.jp/

  • Parallel FEM platform for solid earth simulation.

    • parallel I/O, parallel linear solvers, parallel visualization

    • solid earth: earthquake, plate deformation, mantle/core convection, etc.

  • Part of national project by STA/MEXT for large-scale earth science simulations using the Earth Simulator.

  • Strong collaborations between HPC and natural science (solid earth) communities.


08-APR22

System Configuration ofGeoFEM


08-APR22

Results on Solid Earth Simulation

Complicated Plate Model around Japan Islands

Magnetic Field of the Earth : MHD code

Dh=5.00

Dh=1.25

Simulation of Earthquake Generation Cycle

in Southwestern Japan

T=100

T=200

T=300

T=400

T=500

Transportation by Groundwater Flow through Heterogeneous Porous Media

TSUNAMI !!


08-APR22

Results by GeoFEM


08-APR22

Features of FEM applications (1/2)

  • Local “element-by-element” operations

    • sparse coefficient matrices

    • suitable for parallel computing

  • HUGE “indirect” accesses

    • IRREGULAR sparse matrices

    • memory intensive

do i= 1, N

jS= index(i-1)+1

jE= index(i)

do j= jS, jE

in = item(j)

Y(i)= Y(i) + AMAT(j)*X(in)

enddo

enddo


08-APR22

Features of FEM applications (2/2)

  • In parallel computation …

    • comm. with ONLY neighbors (except “dot products” etc.)

    • amount of messages are relatively small because only values on domain-boundary are exchanged.

    • communication (MPI) latency is critical


08-APR22

  • Background

    • Vector/Scalar Processors

    • GeoFEM Project

    • Earth Simulator

    • Preconditioned Iterative Linear Solvers

  • Optimization Strategy on the Earth Simulator

    • BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications

    • Matrix Assembling

  • Summary & Future Works


08-APR22

Earth Simulator (ES)http://www.es.jamstec.go.jp/

  • 640×8= 5,120 Vector Processors

    • SMP Cluster-Type Architecture

    • 8 GFLOPS/PE

    • 64 GFLOPS/Node

    • 40 TFLOPS/ES

  • 16 GB Memory/Node, 10 TB/ES

  • 640×640 Crossbar Network

    • 16 GB/sec×2

  • Memory BWTH with 32 GB/sec.

  • 35.6 TFLOPS for LINPACK (2002-March)

  • 26 TFLOPS for AFES (Climate Simulation)


08-APR22

Motivations

  • GeoFEM Project (FY.1998-2002)

  • FEM-type applications with complicated unstructured grids (not LINPACK, FDM …) on the Earth Simulator (ES)

    • Implicit Linear Solvers

    • Hybrid vs. Flat MPI Parallel Programming Model


08-APR22

Memory

Memory

P

E

P

E

P

E

P

E

P

E

P

E

P

E

P

E

Memory

Memory

P

E

P

E

P

E

P

E

P

E

P

E

P

E

P

E

Flat MPI vs. Hybrid

Flat-MPI:Each PE -> Independent

Hybrid:Hierarchal Structure


08-APR22

  • Background

    • Vector/Scalar Processors

    • GeoFEM Project

    • Earth Simulator

    • Preconditioned Iterative Linear Solvers

  • Optimization Strategy on the Earth Simulator

    • BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications

    • Matrix Assembling

  • Summary & Future Works


08-APR22

Direct/Iterative Methods for Linear Equations

  • Direct Methods

    • Gaussian Elimination/LU Factorization.

      • compute A-1 directly.

    • Robust for wide range of applications.

    • More expensive than iterative methods (memory, CPU)

    • Not suitable for parallel and vector computation due to its global operations.

  • Iterative Methods

    • CG, GMRES, BiCGSTAB

    • Less expensive than direct methods, especially in memory.

    • Suitable for parallel and vector computing.

    • Convergence strongly depends on problems, boundary conditions (condition number etc.)

      • Preconditioning is required


08-APR22

Preconditioning for Iterative Methods

  • Convergence rate of iterative solvers strongly depends on the spectral properties (eigenvalue distribution) of the coefficient matrix A.

  • A preconditioner M transforms the linear system into one with more favorable spectral properties

    • In "ill-conditioned" problems, "condition number" (ratio of max/min eigenvalue if A is symmetric) is large.

    • M transforms original equation Ax=b into A'x=b' where A'=M-1A, b'=M-1b

  • ILU (Incomplete LU Factorization) or IC (Incomplete Cholesky Factorization) are well-known preconditioners.


08-APR22

Strategy in GeoFEM

  • Iterative method is the ONLY choice for large-scale parallel computing.

  • Preconditioning is important

    • general methods, such as ILU(0)/IC(0), cover wide range of applications.

    • problem specific methods.


08-APR22

  • Background

    • Vector/Scalar Processors

    • GeoFEM Project

    • Earth Simulator

    • Preconditioned Iterative Linear Solvers

  • Optimization Strategy on the Earth Simulator

    • BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications

    • Matrix Assembling

  • Summary & Future Works


08-APR22

Block IC(0)-CG Solver on the Earth Simulator

  • 3D Linear Elastic Problems (SPD)

  • Parallel Iterative Linear Solver

    • Node-based Local Data Structure

    • Conjugate-Gradient Method (CG): SPD

    • Localized Block IC(0) Preconditioning (Block Jacobi)

      • Modified IC(0): Non-diagonal components of original [A] are kept

    • Additive Schwartz Domain Decomposition(ASDD)

    • http://geofem.tokyo.rist.or.jp/

  • Hybrid Parallel Programming Model

    • OpenMP+MPI

    • Re-Ordering for Vector/Parallel Performance

    • Comparison with Flat MPI


08-APR22

Memory

Memory

P

E

P

E

P

E

P

E

P

E

P

E

P

E

P

E

Memory

Memory

P

E

P

E

P

E

P

E

P

E

P

E

P

E

P

E

Flat MPI vs. Hybrid

Flat-MPI:Each PE -> Independent

Hybrid:Hierarchal Structure


08-APR22

PE#1

PE#1

PE#0

4

5

6

12

15

6

7

21

21

21

22

22

22

23

23

23

24

24

24

25

25

25

PE#0

1

11

2

3

14

13

4

5

17

17

17

18

18

18

19

19

19

16

16

16

20

20

20

10

1

2

3

7

8

9

10

12

12

12

13

13

13

14

14

14

11

10

12

11

11

11

15

15

15

8

9

11

12

10

9

11

12

7

7

7

8

8

8

9

9

9

6

6

6

10

10

10

5

9

6

3

8

8

6

4

4

5

5

5

5

PE#3

PE#2

1

2

7

7

1

2

3

PE#3

PE#2

Local Data StructureNode-based Partitioninginternal nodes - elements - external nodes


08-APR22

1 SMP node => 1 domain for Hybrid Programming ModelMPI communication among domains


08-APR22

Basic Strategy for Parallel Programming on the Earth Simulator

  • Hypothesis

    • Explicit ordering is required for unstructured grids in order to achieve higher performance in factorization processes on SMP node and vector processors.


08-APR22

ILU(0)/IC(0) Factorization

do i= 2, n

do k= 1, i-1

if ((i,k)∈ NonZero(A)) then

aik := aik/akk

endif

do j= k+1, n

if ((i,j)∈ NonZero(A)) then

aij := aij - aik*akj

endif

enddo

enddo

enddo


08-APR22

ILU/IC Preconditioning

M = (L+D)D-1(D+U)

Mz= r

D-1(D+U)= z1

Forward Substitution

(L+D)z1= r : z1= D-1(r-Lz1)

Backward Substitution

(I+ D-1 U)zNEW = z1

z= z - D-1Uz

need to solve this equation


08-APR22

ILU/IC Preconditioning

do i= 1, N

WVAL= R(i)

do j= 1, INL(i)

WVAL= WVAL - AL(i,j) * Z(IAL(i,j))

enddo

Z(i)= WVAL / D(i)

enddo

do i= N, 1, -1

SW = 0.0d0

do j= 1, INU(i)

SW= SW + AU(i,j) * Z(IAU(i,j))

enddo

Z(i)= Z(i) – SW / D(i)

enddo

Forward Substitution

(L+D)z= r : z= D-1(r-Lz)

M =(L+D)D-1(D+U)

L,D,U: A

Dependency…

You need the most

recent value of “z”

of connected nodes.

Vectorization/parallelization is difficult.

Backward Substitution

(I+ D-1 U)znew= zold : z= z - D-1Uz


08-APR22

Basic Strategy for Parallel Programming on the Earth Simulator

  • Hypothesis

    • Explicit ordering is required for unstructured grids in order to achieve higher performance in factorization processes on SMP node and vector processors.

  • Re-Ordering for Highly Parallel/Vector Performance

    • Local operation and no global dependency

    • Continuous memory access

    • Sufficiently long loops for vectorization


08-APR22

ILU/IC Preconditioning

do i= 1, N

WVAL= R(i)

do j= 1, INL(i)

WVAL= WVAL - AL(i,j) * Z(IAL(i,j))

enddo

Z(i)= WVAL / D(i)

enddo

do i= N, 1, -1

SW = 0.0d0

do j= 1, INU(i)

SW= SW + AU(i,j) * Z(IAU(i,j))

enddo

Z(i)= Z(i) – SW / D(i)

enddo

Forward Substitution

(L+D)z= r : z= D-1(r-Lz)

M =(L+D)D-1(D+U)

L,D,U: A

Dependency…

You need the most

recent value of “z”

of connected nodes.

Vectorization/parallelization is difficult.

Reordering:

Directly connected

nodes do not appear

in RHS.

Backward Substitution

(I+ D-1 U)znew= zold : z= z - D-1Uz


08-APR22

Basic Strategy for Parallel Programming on the Earth Simulator

  • Hypothesis

    • Explicit ordering is required for unstructured grids in order to achieve higher performance in factorization processes on SMP node and vector processors.

  • Re-Ordering for Highly Parallel/Vector Performance

    • Local operation and no global dependency

    • Continuous memory access

    • Sufficiently long loops for vectorization

  • 3-Way Parallelism for Hybrid Parallel Programming

    • Inter Node : MPI

    • Intra Node : OpenMP

    • Individual PE : Vectorization


08-APR22

Vector

SMP Parallel

Vector

SMP Parallel

These processes can be substituted by traditional multi-coloring (MC).

Re-Ordering Technique for Vector/Parallel ArchitecturesCyclic DJDS(RCM+CMC) Re-Ordering(Doi, Washio, Osoda and Maruyama (NEC))

1. RCM (Reverse Cuthil-Mckee)

2. CMC (Cyclic Multicolor)

3. DJDS re-ordering

4. Cyclic DJDS for SMP unit


08-APR22

Reordering = Coloring

  • COLOR: Unit of independent sets.

  • Elements grouped in the same “color” are independent from each other, thus parallel/vector operation is possible.

  • Many colors provide faster convergence, but shorter vector length.: Trade-Off !!

RCM (Reverse CM)

Red-Black (2 colors)

4 colors


08-APR22

2D-Storage

long vector length, many ZERO’s

1D-Storage (CRS)

memory saved, short vector length

Large Scale Sparse Matrix Storage for Unstructured Grids


08-APR22

Re-Ordering in each Coloraccording to Non-Zero Off-Diag. Component #

Elements on the same color are independent, thereforeintra-hyperplane re-ordering does not affect results

DJDS : Descending-Order Jagged Diagonal Storage


08-APR22

iS+1

iv0+1

iE

Cyclic DJDS (MC/CM-RCM)Cyclic Re-Ordering for SMP unitsLoad-balancing among PEs

do iv= 1, NCOLORS

!$omp parallel do

do ip= 1, PEsmpTOT

iv0= STACKmc(PEsmpTOT*(iv-1)+ip- 1)

do j= 1, NLhyp(iv)

iS= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip-1)

iE= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip )

!cdir nodep

do i= iv0+1, iv0+iE-iS

k= i+iS - iv0

kk= IAL(k)

(Important Computations)

enddo

enddo

enddo

enddo

npLX1= NLmax * PEsmpTOT

INL(0:NLmax*PEsmpTOT*NCOLORS)


08-APR22

Difference betweenFlat MPI & Hybrid

  • Most of the efforts of re-ordering are for vectorization.

  • If you have a long vector, just divide it and distribute the segments to PEs on SMP nodes.

  • Source codes of Hybrid and Flat MPI are not so different.

    • Flat MPI corresponds to Hybrid where PE/SMP node=1.

    • In other words, Flat MPI code is sufficiently complicated.


08-APR22

Cyclic DJDS (MC/CM-RCM)for Forward/Backward Substitution in BILU Factorization


08-APR22

z

Uniform Distributed Force in z-dirrection @ z=Zmin

Uy=0 @ y=Ymin

Ux=0 @ x=Xmin

Nz-1

Ny-1

y

Uz=0 @ z=Zmin

Nx-1

x

Simple 3D Cubic Model


08-APR22

Effect of Ordering


08-APR22

Long Loops

Continuous Access

Short Loops

Irregular Access

Short Loops

Continuous Access

Effect of Re-Ordering


08-APR22

Matrix Storage, Loops

do iv= 1, NVECT

iv0= STACKmc(iv-1)

do j= 1, NLhyp(iv)

iS= index_L(NL*(iv-1)+ j-1)

iE= index_L(NL*(iv-1)+ j )

do i= iv0+1, iv0+iE-iS

k= i+iS - iv0

kk= item_L(k)

Z(i)= Z(i) - AL(k)*Z(kk)

enddo

iS= STACKmc(iv-1) + 1

iE= STACKmc(iv )

do i= iS, iE

Z(i)= Z(i)/DD(i)

enddo

enddo

enddo

DJDS

  • DJDS (Descending order Jagged Diagonal Storage) with long innermost loops is suitable for vector processors.

DCRS

do i= 1, N

SW= WW(i,Z)

isL= index_L(i-1)+1

ieL= index_L(i)

do j= isL, ieL

k= item_L(j)

SW= SW - AL(j)*Z(k)

enddo

Z(i)= SW/DD(i)

enddo

  • Reduction type loop of DCRS is more suitable for cache-based scalar processor because of its localized operation.


08-APR22

Effect of Vector Length

X10

+ Re-Ordering X100

22 GFLOPS,

34% of the Peak

Ideal Performance: 40%-45%

for Single CPU

Effect of Re-OrderingResults on 1 SMP nodeColor #: 99 (fixed)Re-Ordering is REALLY required !!!

●■▲


08-APR22

Effect of Re-OrderingResults on 1 SMP nodeColor #: 99 (fixed)Re-Ordering is REALLY required !!!

●■▲

80x80x80 case (1.5M DOF)

● 212 iter’s, 11.2 sec.

■ 212 iter’s, 143.6 sec.

▲ 203 iter’s, 674.2 sec.


08-APR22

3D Elastic SimulationProblem Size~GFLOPSEarth Simulator1 SMP node (8 PE’s)

Flat-MPI is better.

Nice Intra-Node MPI.

Flat-MPI

23.4 GFLOPS, 36.6 % of Peak

Hybrid (OpenMP)

21.9 GFLOPS, 34.3 % of Peak


08-APR22

IBM SP3

(LBNL)

Hitachi

SR11000/J2

(U.Tokyo)

1.50

9.20

0.623

8.00

.072-.076

(4.8-5.1)

.880-.973

(9.6-10.6)

.122

(8.1)

1.34

(14.5)

0.413

0.870

1.00

12.0

16.3

4.7 *

Earth

Simulator

Hitachi

SR8000/MPP

(U.Tokyo)

Peak Performance

GFLOPS

8.00

1.80

Measured Memory

BW (STREAM)

(GB/sec/PE)

26.6

2.85

Estimated

Performance/Core

GFLOPS

(% of peak)

2.31-3.24

(28.8-40.5)

.291-.347

(16.1-19.3)

Measured

Performance/Core

2.93

(36.6)

.335

(18.6)

BYTE/FLOP

3.325

1.583

Comm. BW

(GB/sec/Node)

12.3

1.60

* IBM p595

J.T.Carter

et al.

MPI Latency

(msec)

5.6-7.7

6-20


08-APR22

3D Elastic SimulationProblem Size~GFLOPSHitachi-SR8000-MPP withPseudo Vectorization1 SMP node (8 PE’s)

Hybrid is better.

Low Intra-Node MPI.

Flat-MPI

2.17 GFLOPS, 15.0 % of Peak

Hybrid (OpenMP)

2.68 GFLOPS, 18.6 % of Peak


08-APR22

3D Elastic SimulationProblem Size~GFLOPSIBM-SP3 (NERSC) 1 SMP node (8 PE’s)

Cache is well-utilized

in Flat-MPI.

Flat-MPI

Hybrid (OpenMP)


08-APR22

3D Elastic SimulationProblem Size~GFLOPSHitachi SR11000/J2 (U.Tokyo) 1 SMP node (8 PE’s)

Flat-MPI

Hybrid (OpenMP)


08-APR22

SMP node # > 10up to 176 nodes (1408 PEs)Problem size for each SMP node is fixed.PDJDS-CM/RCM, Color #: 99


08-APR22

3D Elastic Model (Large Case)256x128x128/SMP node, up to 2,214,592,512 DOF

Parallel Work Ratio

GFLOPS rate

○Hybrid

●Flat MPI

3.8TFLOPS for 2.2G DOF

176 nodes (33.8% of peak)

●:Flat MPI, ○:Hybrid


08-APR22

3D Elastic Model (Small Case)64x64x64/SMP node, up to 125,829,120 DOF

Parallel Work Ratio

GFLOPS rate

○Hybrid

●Flat MPI

●:Flat MPI, ○:Hybrid


08-APR22

Hybrid outperforms Flat-MPI

  • when ...

    • number of SMP node (PE) is large.

    • problem size/node is small.

  • because flat-MPI has ...

    • as 8 times as many communication processes

    • as TWICE as large communication/computation ratio

  • Effect of communication becomes significant if number of SMP node (or PE) is large.

  • Performance Estimation by D.Kerbyson (LANL)

    • LA-UR-02-5222

    • relatively larger communication latency of ES


08-APR22

Flat-MPI and Hybrid

Hybrid

Flat-MPI


08-APR22

IBM SP3

(LBNL)

Hitachi

SR11000/J2

(U.Tokyo)

1.50

9.20

0.623

8.00

.072-.076

(4.8-5.1)

.880-.973

(9.6-10.6)

.122

(8.1)

1.34

(14.5)

0.413

0.870

1.00

12.0

16.3

4.7 *

Earth

Simulator

Hitachi

SR8000/MPP

(U.Tokyo)

Peak Performance

GFLOPS

8.00

1.80

Measured Memory

BW (STREAM)

(GB/sec/PE)

26.6

2.85

Estimated

Performance/Core

GFLOPS

(% of peak)

2.31-3.24

(28.8-40.5)

.291-.347

(16.1-19.3)

Measured

Performance/Core

2.93

(36.6)

.335

(18.6)

BYTE/FLOP

3.325

1.583

Comm. BW

(GB/sec/Node)

12.3

1.60

* IBM p595

J.T.Carter

et al.

MPI Latency

(msec)

5.6-7.7

6-20


08-APR22

Why Communication Overhead ?

  • latency of network

  • finite bandwidth of network

  • synchronization at SEND/RECV, ALLREDUCE etc.

  • memory performance in boundary communications (memory COPY)


08-APR22

do neib= 1, NEIBPETOT

istart= IMPORT_INDEX(neib-1)

inum = IMPORT_INDEX(neib ) - istart

call MPI_IRECV

(WR(istart+1), inum, MPI_DOUBLE_PRECISION, &

NEIBPE(neib), 0, SOLVER_COMM, &

req2(neib), ierr)

enddo

call MPI_WAITALL (NEIBPETOT, req2, sta2, ierr)

do neib= 1, NEIBPETOT

istart= IMPORT_INDEX(neib-1)

inum = IMPORT_INDEX(neib ) - istart

do k= istart+1, istart+inum

X(IMPORT_NODE(k))= WR(k)

enddo

enddo

call MPI_WAITALL (NEIBPETOT, req1, sta1, ierr)

return

end

RECEIVE

do neib= 1, NEIBPETOT

istart= EXPORT_INDEX(neib-1)

inum = EXPORT_INDEX(neib ) - istart

do k= istart+1, istart+inum

WS(k)= X(EXPORT_NODE(k))

enddo

call MPI_ISEND

(WS(istart+1), inum, MPI_DOUBLE_PRECISION, &

NEIBPE(neib), 0, SOLVER_COMM, &

req1(neib), ierr)

enddo

SEND

Domain-to-Domain CommunicationExchange Boundary Information (SEND/RECV)

subroutine SOLVER_SEND_RECV &

(N, NEIBPETOT, NEIBPE, &

IMPORT_INDEX, IMPORT_NODE, &

EXPORT_INDEX, EXPORT_NODE, &

WS, WR, X, SOLVER_COMM, my_rank)

implicit REAL*8 (A-H,O-Z)

include 'mpif.h'

parameter (KREAL= 8)

integer IMPORT_INDEX(0:NEIBPETOT), IMPORT_NODE(N)

integer EXPORT_INDEX(0:NEIBPETOT), EXPORT_NODE(N)

integer SOLVER_COMM, my_rank

integer req1(NEIBPETOT), req2(NEIBPETOT)

integer sta1(MPI_STATUS_SIZE, NEIBPETOT)

integer sta2(MPI_STATUS_SIZE, NEIBPETOT)

real(kind=KREAL) X(N), NEIBPE(NEIBPETOT), WS(N), WR(N)


08-APR22

do neib= 1, NEIBPETOT

istart= IMPORT_INDEX(neib-1)

inum = IMPORT_INDEX(neib ) - istart

call MPI_IRECV

(WR(istart+1), inum, MPI_DOUBLE_PRECISION, &

NEIBPE(neib), 0, SOLVER_COMM, &

req2(neib), ierr)

enddo

call MPI_WAITALL (NEIBPETOT, req2, sta2, ierr)

do neib= 1, NEIBPETOT

istart= IMPORT_INDEX(neib-1)

inum = IMPORT_INDEX(neib ) - istart

do k= istart+1, istart+inum

X(IMPORT_NODE(k))= WR(k)

enddo

enddo

call MPI_WAITALL (NEIBPETOT, req1, sta1, ierr)

return

end

RECEIVE

do neib= 1, NEIBPETOT

istart= EXPORT_INDEX(neib-1)

inum = EXPORT_INDEX(neib ) - istart

do k= istart+1, istart+inum

WS(k)= X(EXPORT_NODE(k))

enddo

call MPI_ISEND

(WS(istart+1), inum, MPI_DOUBLE_PRECISION, &

NEIBPE(neib), 0, SOLVER_COMM, &

req1(neib), ierr)

enddo

SEND

Domain-to-Domain CommunicationExchange Boundary Information (SEND/RECV)

subroutine SOLVER_SEND_RECV &

(N, NEIBPETOT, NEIBPE, &

IMPORT_INDEX, IMPORT_NODE, &

EXPORT_INDEX, EXPORT_NODE, &

WS, WR, X, SOLVER_COMM, my_rank)

implicit REAL*8 (A-H,O-Z)

include 'mpif.h'

parameter (KREAL= 8)

integer IMPORT_INDEX(0:NEIBPETOT), IMPORT_NODE(N)

integer EXPORT_INDEX(0:NEIBPETOT), EXPORT_NODE(N)

integer SOLVER_COMM, my_rank

integer req1(NEIBPETOT), req2(NEIBPETOT)

integer sta1(MPI_STATUS_SIZE, NEIBPETOT)

integer sta2(MPI_STATUS_SIZE, NEIBPETOT)

real(kind=KREAL) X(N), NEIBPE(NEIBPETOT), WS(N), WR(N)


08-APR22

Communication Overhead

Memory

Copy

Comm.

Latency

Comm.

BandWidth


08-APR22

Communication Overhead

Memory

Copy

Comm.

Latency

Comm.

BandWidth

depends on

message size

depends on

message size


08-APR22

Communication OverheadEarth Simulator

Comm.

Latency


08-APR22

Communication OverheadHitachi SR11000, IBM SP3 etc.

Memory

Copy

Comm.

BandWidth


08-APR22

Communication Overhead= Synchronization Overhead


08-APR22

Communication Overhead= Synchronization Overhead

Memory

Copy

Comm.

Latency

Comm.

BWTH


08-APR22

Communication Overhead= Synchronization OverheadEarth Simulator

Comm.

Latency


08-APR22

Communication OverheadWeak Scaling: Earth Simulator

Effect of message size

is small. Effect of latency

is large.

Memory-copy is so fast.


08-APR22

Communication OverheadWeak Scaling: IBM SP-3

Effect of message size

is more significant.


08-APR22

Communication OverheadWeak Scaling: Hitachi SR11000/J2 (8cores/node)


08-APR22

Summary

  • Hybrid Parallel Programming Model on SMP Cluster Architecture with Vector Processors for Unstructured Grids

  • Nice parallel performance for both inter/intra SMP node on ES, 3.8TFLOPS for 2.2G DOF on 176 nodes (33.8%) in 3D linear-elastic problem using BIC(0)-CG method.

    • N.Kushida (student of Prof.Okuda) attained >10 TFLOPS using 512 nodes for >3G DOF problem.

  • Re-Ordering is really required


08-APR22

Summary (cont.)

  • Hybrid vs. Flat MPI

    • Flat-MPI is better for small number of SMP nodes.

    • Hybrid is better for large number of SMP nodes: Especially when problem size is rather small.

    • Flat MPI: Communication, Hybrid: Memory

    • depends on application, problem size etc.

    • Hybrid is much more sensitive to color numbers than flat MPI due to synchronization overhead of OpenMP.

  • In Mat-Vec. operations, difference is not so significant.


08-APR22

  • Background

    • Vector/Scalar Processors

    • GeoFEM Project

    • Earth Simulator

    • Preconditioned Iterative Linear Solvers

  • Optimization Strategy on the Earth Simulator

    • BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications

    • Matrix Assembling

  • Summary & Future Works


08-APR22

“CUBE” Benchmark

  • 3D linear elastic applications on cubes for a wide range of problem size.

  • Hardware

    • Single CPU

    • Earth Simulator

    • AMD Opteron (1.8GHz)


08-APR22

DCRS

sec.

(MFLOPS)

DJDS original

sec.

(MFLOPS)

ES

8.0 GFLOPS

Matrix

28.6

(291)

34.2

(240)

21.7

(3246)

Solver

360

(171)

Total

389

55.9

Opteron

1.8GHz

3.6 GFLOPS

Matrix

10.2

(818)

12.4

(663)

271

(260)

Solver

225

(275)

Total

235

283

Time for 3x643=786,432 DOF


08-APR22

Matrix

Matrix+Solver

Solver

DJDS on ES

original

DJDS on Opteron

original


08-APR22

Computation Time vs. Problem Size

Matrix

Solver

Total

ES (DJDS original)

Opteron

(DJDS original)

Opteron (DCRS)


08-APR22

Matrix assembling/formation part is rather expensive

  • This part should be also optimized for vector processors.

  • For example, in nonlinear simulations such as elasto-plastic solid simulations, or fully coupled Navier-Stokes flow simulations, matrices must be updated for every nonlinear iterations.

  • This part strongly depends on applications/physics, therefore it’s very difficult to develop general libraries, such as those of iterative linear solvers.

    • also includes complicated processes which are difficult to be “vectorized”


08-APR22

Typical Procedure for Calculating Coefficient Matrix in FEM

  • Apply Galerkin’s method on each element.

  • Integration over each element, and get element-matrix.

  • Element matrices are accumulated to each node, and global matrices are obtained

    => Global linear equations

  • Matrix assembling/formation is embarrassingly parallel procedure due to its element-by-element feature


08-APR22

Element-by-Element Operations

  • Integration over each element => element-matrix

  • Element matrices are accumulated to each node

    => global-matrix

  • Linear equations for each node

19

20

21

22

23

24

Elements

13

14

15

16

17

18

7

8

9

10

11

12

1

2

3

4

5

6


08-APR22

Element-by-Element Operations

  • Integration over each element => element-matrix

  • Element matrices are accumulated to each node

    => global-matrix

  • Linear equations for each node

29

30

31

32

33

34

35

19

20

21

22

23

24

Nodes

22

23

24

25

26

27

28

13

14

15

16

17

18

15

16

17

18

19

20

21

7

8

9

10

11

12

8

9

10

11

12

13

14

1

2

3

4

5

6

1

2

3

4

5

6

7


08-APR22

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Element-by-Element Operations

  • Integration over each element => element-matrix

  • Element matrices are accumulated to each node

    => global-matrix

  • Linear equations for each node


08-APR22

Element-by-Element Operations

  • Integration over each element => element-matrix

  • Element matrices are accumulated to each node

    => global-matrix

  • Linear equations for each node


08-APR22

Element-by-Element Operations

  • Integration over each element => element-matrix

  • Element matrices are accumulated to each node

    => global-matrix

  • Linear equations for each node

29

30

31

32

33

34

35

22

23

24

25

26

27

28

15

16

17

18

19

20

21

8

9

10

11

12

13

14

1

2

3

4

5

6

7


08-APR22

Element-by-Element Operations

  • Integration over each element => element-matrix

  • Element matrices are accumulated to each node

    => global-matrix

  • Linear equations for each node


08-APR22

Element-by-Element Operations

22

23

24

13

14

15

16

17

7

8

8

9

10

  • If you calculate a23,16 and a16,23, you have to consider contribution by both of 13th and 14th elements.


08-APR22

Current Approach

do icel= 1, ICELTOT

do ie= 1, 4

do je= 1, 4

- assemble element-matrix

- accumulated element-matrix to global-matrix

enddo

enddo

enddo

22

23

24

13

14

15

16

17

7

8

8

9

10


08-APR22

Current Approach

do icel= 1, ICELTOT

do ie= 1, 4

do je= 1, 4

- assemble element-matrix

- accumulat element-matrix to global-matrix

enddo

enddo

enddo

22

23

24

13

14

Local Node ID

15

16

17

7

8

8

9

10

4

3

Local Node ID

for each bi-linear 4-node

element

1

2


08-APR22

Current Approach

do icel= 1, ICELTOT

do ie= 1, 4

do je= 1, 4

- assemble element-matrix

- accumulat element-matrix to global-matrix

enddo

enddo

enddo

22

23

24

13

14

Local Node ID

15

16

17

7

8

8

9

10

  • Nice for cache reuse because of localized operations

  • Not suitable for vector processors

    • a16,23 and a23,16 might not be calculated properly.

    • Short innermost loops

    • There are many “if-then-else” s


08-APR22

Inside the loop: integration at Gaussian quadrature points

do jpn= 1, 2

do ipn= 1, 2

coef= dabs(DETJ(ipn,jpn))*WEI(ipn)*WEI(jpn)

PNXi= PNX(ipn,jpn,ie)

PNYi= PNY(ipn,jpn,ie)

PNXj= PNX(ipn,jpn,je)

PNYj= PNY(ipn,jpn,je)

a11= a11 + (valX*PNXi*PNXj + valB*PNYi*PNYj)*coef

a22= a22 + (valX*PNYi*PNYj + valB*PNXi*PNXj)*coef

a12= a12 + (valA*PNXi*PNYj + valB*PNXj*PNYi)*coef

a21= a21 + (valA*PNYi*PNXj + valB*PNYj*PNXi)*coef

enddo

enddo


08-APR22

Remedy

do icel= 1, ICELTOT

do ie= 1, 4

do je= 1, 4

- assemble element-matrix

- accumulat element-matrix to global-matrix

enddo

enddo

enddo

22

23

24

13

14

15

16

17

7

8

8

9

10

  • a16,23 and a23,16 might not be calculated properly.

    • coloring the elements: elements which do not share any nodes are in same color.


08-APR22

Coloring of Elements

29

30

31

32

33

34

35

22

23

24

25

26

27

28

15

16

17

18

19

20

21

8

9

10

11

12

13

14

1

2

3

4

5

6

7


08-APR22

Coloring of Elements

Elements sharing the 16th node are assigned to different colors

29

30

31

32

33

34

35

22

23

24

25

26

27

28

15

16

17

18

19

20

21

8

9

10

11

12

13

14

1

2

3

4

5

6

7


08-APR22

Remedy

do icel= 1, ICELTOT

do ie= 1, 4

do je= 1, 4

- assemble element-matrix

- accumulat element-matrix to global-matrix

enddo

enddo

enddo

22

23

24

13

14

15

16

17

7

8

8

9

10

  • a16,23 and a23,16 might not be calculated properly.

    • coloring the elements: elements which do not share any nodes are in same color.

  • Short innermost loops

    • loop exchange


08-APR22

Remedy

do icel= 1, ICELTOT

do ie= 1, 4

do je= 1, 4

- assemble element-matrix

- accumulat element-matrix to global-matrix

enddo

enddo

enddo

22

23

24

13

14

15

16

17

7

8

8

9

10

  • a16,23 and a23,16 might not be calculated properly.

    • coloring the elements: elements which do not share any nodes are in same color.

  • Short innermost loops

    • loop exchange

  • There are many “if-then-else” s

    • define ELEMENT-to-MATRIX array


08-APR22

Define ELEMENT-to-MATRIX array

22

23

24

13

14

22

23

23

24

13

14

15

16

17

Local Node ID

7

8

15

16

16

17

8

9

10

if

kkU= index_U(16-1+k) and

item_U(kkU)= 23 then

ELEMmat(13,2,3)= +kkU

ELEMmat(14,1,4)= +kkU

endif

if

kkL= index_L(23-1+k) and

item_L(kkL)= 16 then

ELEMmat(13,3,2)= -kkL

ELEMmat(14,4,1)= -kkL

endif

ELEMmat(icel, ie, je)

Local Node ID

Element ID


08-APR22

Define ELEMENT-to-MATRIX array

22

23

24

13

14

22

23

23

24

13

14

15

16

17

Local Node ID

7

8

15

16

16

17

8

9

10

if

kkU= index_U(16-1+k) and

item_U(kkU)= 23 then

ELEMmat(13,2,3)= +kkU

ELEMmat(14,1,4)= +kkU

endif

if

kkL= index_L(23-1+k) and

item_L(kkL)= 16 then

ELEMmat(13,3,2)= -kkL

ELEMmat(14,4,1)= -kkL

endif

“ELEMmat” specifies relationship between node pairs of each node of each element and address of global coefficient matrix.


08-APR22

Optimized Procedure

do icol= 1, NCOLOR_E_tot

do ie= 1, 4

do je= 1, 4

do ic0= index_COL(icol-1)+1, indexCOL_(icol)

icel= item_COL(ic0)

- define “ELEMmat” array

enddo

enddo

enddo

enddo

  • Extra Computation for

  • ELEMmat

  • Extra Storage for

  • ELEMmat array

  • element-matrix components for elements in each color

  • < 10% increase

do icol= 1, NCOLOR_E_tot

do ie= 1, 4

do je= 1, 4

do ic0= index_COL(icol-1)+1, indexCOL_(icol)

icel= item_COL(ic0)

- assemble element-matrix

enddo

enddo

enddo

do ie= 1, 4

do je= 1, 4

do ic0= index_COL(icol-1)+1, indexCOL_(icol)

icel= item_COL(ic0)

- accumulate element-matrix to global-matrix

enddo

enddo

enddo

enddo


08-APR22

Optimized Procedure

PART I

“Integer” operations for “ELEMmat”

In nonlinear cases, this part should be done just once (before initial iteration), as long as mesh connectivity does not change.

do icol= 1, NCOLOR_E_tot

do ie= 1, 4

do je= 1, 4

do ic0= index_COL(icol-1)+1, indexCOL_(icol)

icel= item_COL(ic0)

- define “ELEMmat” array

enddo

enddo

enddo

enddo

do icol= 1, NCOLOR_E_tot

do ie= 1, 4

do je= 1, 4

do ic0= index_COL(icol-1)+1, indexCOL_(icol)

icel= item_COL(ic0)

- assemble element-matrix

enddo

enddo

enddo

do ie= 1, 4

do je= 1, 4

do ic0= index_COL(icol-1)+1, indexCOL_(icol)

icel= item_COL(ic0)

- accumulate element-matrix to global-matrix

enddo

enddo

enddo

enddo


08-APR22

Optimized Procedure

PART I

“Integer” operations for “ELEMmat”

In nonlinear cases, this part should be done just once (before initial iteration), as long as mesh connectivity does not change.

do icol= 1, NCOLOR_E_tot

do ie= 1, 4

do je= 1, 4

do ic0= index_COL(icol-1)+1, indexCOL_(icol)

icel= item_COL(ic0)

- define “ELEMmat” array

enddo

enddo

enddo

enddo

do icol= 1, NCOLOR_E_tot

do ie= 1, 4

do je= 1, 4

do ic0= index_COL(icol-1)+1, indexCOL_(icol)

icel= item_COL(ic0)

- assemble element-matrix

enddo

enddo

enddo

do ie= 1, 4

do je= 1, 4

do ic0= index_COL(icol-1)+1, indexCOL_(icol)

icel= item_COL(ic0)

- accumulate element-matrix to global-matrix

enddo

enddo

enddo

enddo

PART II

“Floating” operations for matrix assembling/accumulation.

In nonlinear cases, this part is repeated for every nonlinear iteration.


08-APR22

Time for 3x643=786,432 DOF

DCRS

sec.

(MFLOPS)

DJDS original

sec.

(MFLOPS)

DJDS improved

sec.

(MFLOPS)

ES

8.0 GFLOPS

12.5

(643)

Matrix

28.6

(291)

34.2

(240)

21.7

(3246)

21.7

(3246)

Solver

360

(171)

Total

389

55.9

34.2

21.2

(381)

Opteron

1.8GHz

3.6 GFLOPS

Matrix

10.2

(818)

12.4

(663)

271

(260)

271

(260)

Solver

225

(275)

Total

235

283

292


08-APR22

Time for 3x643=786,432 DOF

DCRS

sec.

(MFLOPS)

DJDS original

sec.

(MFLOPS)

DJDS improved

sec.

(MFLOPS)

ES

8.0 GFLOPS

12.5

(643)

Matrix

28.6

(291)

34.2

(240)

21.7

(3246)

21.7

(3246)

Solver

360

(171)

Total

389

55.9

34.2

21.2

(381)

Opteron

1.8GHz

3.6 GFLOPS

Matrix

10.2

(818)

12.4

(663)

271

(260)

271

(260)

Solver

225

(275)

Slower than original

because of long innermost

loops (data locality has been

lost)

Total

235

283

292


08-APR22

Matrix

Matrix+Solver

Solver

DJDS on ES

improved

DJDS on ES

original

DJDS on Opteron

improved

DJDS on Opteron

original


08-APR22

Computation Time vs. Problem Size

Matrix

Solver

Total

ES (DJDS improved)

Opteron

(DJDS improved)

Opteron (DCRS)


08-APR22

“Matrix” computation time for improved version of DJDS

Integer

ES

Opteron

Floating


08-APR22

Optimization of “Matrix” assembling/formation on ES

  • DJDS has been much improved compared to the original one, but it’s still slower than DCRS version on Opteron.

  • “Integer” operation part is slower.

  • But, “floating” operation is much faster than Opteron.

  • In nonlinear simulations, “integer” operation is executed only once (just before initial iteration), therefore, ES outperforms Opteron if the number of nonlinear iterations is more than 2.


08-APR22

Suppose “virtual” mode where …

  • On scalar processor

    • “Integer” operation part

  • On vector processor

    • “floating” operation part

    • linear solvers

  • Scalar performance of ES (500MHz) is smaller than that of Pentium III


08-APR22

Time for 3x643=786,432 DOF

DCRS

sec.

(MFLOPS)

DJDS improved

sec.

(MFLOPS)

DJDS virtual

sec.

(MFLOPS)

ES

8.0 GFLOPS

12.5

(643)

Matrix

28.6

(291)

1.88

(4431)

21.7

(3246)

21.7

(3246)

Solver

360

(171)

Total

389

34.2

23.6

21.2

(381)

Opteron

1.8GHz

3.6 GFLOPS

Matrix

10.2

(818)

271

(260)

Solver

225

(275)

Total

235

292


08-APR22

Summary: Vectorization of FEM appl.

  • NOT so easy

  • FEM’s good features of local operations are not necessarily suitable for vector processors.

    • Preconditioned iterative solvers can be vectorized rather easier, because their target is “global” matrix.

  • Sometimes, major revision of original codes are required

    • Usually, more memory, more lines, additional operations …

  • Performance of optimized codes for vector processor is not necessarily good on scalar processors (e.g. matrix assembling of FEM)


  • Login