Optimization and tuning techniques of lattice QCD for Blue Gene

Optimization and tuningtechniques of lattice QCDfor Blue Gene Jun Doi Tokyo Research Laboratory IBM Japan

Agenda • Part I: • Optimization of lattice QCD program using double FPU instructions • Part II: • Parallelization of lattice QCD and optimization of communication

Part I: • Optimization of lattice QCD program using double FPU instructions

Optimization of lattice QCD for Blue Gene • Our lattice QCD program • Wilson’s method • Original program is written in C++ • Optimization using double FPU instructions • We used inline assembly to optimize complex arithmetic • We have to schedule instructions by ourselves instead of compiler

Wilson-Dirac operator • Exchanging colors by multiplying color and 3x3 gauge matrix for 4 spin for 8 directions x+,x-,y+,y-,z+,z-,t+ and t- • : 4x3 spinor • U : 3x3 gauge matrix • Ut : Hermitian matrix of U • (1+γ),(1-γ): 4x4 projector matrix  (x,y-1,z,t) Uy(x,y-1,z,t) Ux(x-1,y,z,t) Ux(x,y,z,t) (x-1,y,z,t) (x+1,y,z,t) (x,y,z,t) Uy(x,y,z,t) (x,y+1,z,t)

uxp : Part of Wilson-Dirac operator for X plus direction Projector : Multiplying symmetric projector , we can merge 4 spinor into 2 spinor to calculate uxp II. Multiplying 2 spinors and gauge I. Merging 4 spinors into 2 III. Adding to 4 spinors

Floating point register usage for u?p, u?m • 4x3 spinor to add result= 12 registers • 4x3 spinor for neighboring lattice point (input) = 12 registers • 3x3 gauge matrix = 9 registers • = 33 registers are needed • Additional registers are needed for constant values 12 registers to add result (Always loaded for other directions) FP 0 to FP 11 Adding to 4x3 spinor FP 12 to FP 17 6 registers for gauge matrix Last 3 elements are loaded after first 3 are multiplied 6 registers for 2 spinors 12 registers for input and to save result FP 18 to FP 29 6 registers to save result Merging 4 spinors into 2 spinors FP 30 FP 31 2 reg. for others

Step I : Merging 4 spinors into 2 spinors FR18 FR18 spinor 1 R C R FR19 FR19 spinor 1 G C G FR20 FR20 spinor 1 B C B FR21 spinor 2 R FR21 D R FR22 spinor 2 G FR22 D G FR23 spinor 2 B FR23 D B FR24 spinor 3 R FR25 spinor 3 G double unit[2] = {1,1}; LFPDX 31,unit … LFPDX 18,spinor1_R … LFPDX 27,spinor4_R … FXCXNPMA 18,31,27,18 FR26 spinor 3 B FR27 spinor 4 R FR28 spinor 4 G FR29 spinor 4 B v = x + i * y Re(v) = Re(x) – Im(y) Im(v) = Im(x) + Re(y)

A C B - = -( ) x FR18 FR63 FR59 FR18 = x + FR50 FR63 FR27 FR50 FXCXNPMA instruction double unit[2] = {1,1}; LFPDX 31,unit … LFPDX 18,spinor1_R … LFPDX 27,spinor4_R … FXCXNPMA 18,31,27,18 Primary FPU register FR0~FR31 Secondary FPU register FR32~FR63 A C B S A C B S MUL MUL ADD ADD

Step II : Multiplying spinor and 3x3 gauge matrix (for +directions) x: input spinor y: output spinor u : 3x3 gauge matrix y[0] = u[0][0] * x[0] + u[0][1] * x[1] + u[0][2] * x[2]; y[1] = u[1][0] * x[0] + u[1][1] * x[1] + u[1][2] * x[2]; y[2] = u[2][0] * x[0] + u[2][1] * x[1] + u[2][2] * x[2]; re(y[0])=re(u[0][0])*re(x[0]) im(y[0])=re(u[0][0])*im(x[0]) u[0][0] * x[0] :Multiplying 2 complex numbers FXPMUL (y[0],u[0][0],x[0]) FXCXNPMA (y[0],u[0][0],x[0],y[0]) re(y[0])+=-im(u[0][0])*im(x[0]) im(y[0])+=im(u[0][0])*re(x[0]) + u[0][1] * x[1] + u[0][2] * x[2]; Using FMA instructions FXCPMADD (y[0],u[0][1],x[1],y[0]) FXCXNPMA (y[0],u[0][1],x[1],y[0]) FXCPMADD (y[0],u[0][2],x[2],y[0]) FXCXNPMA (y[0],u[0][2],x[2],y[0])

FXPMUL 5,3,4 A C x = FR5 FR3 FR4 = x FR37 FR3 FR36 FXCPMADD 10,0,1,5 A C B + x = FR10 FR0 FR1 FR5 = x + FR42 FR0 FR33 FR37 FXPMUL and FXCPMADD instruction Primary FPU register FR0~FR31 Secondary FPU register FR32~FR63 A C B S A C B S MUL MUL ADD ADD

Multiplying Hermitian gauge matrix (for -directions) x: input spinor y: output spinor ~u : conjugate complex Multiplying Hermitian matrix is as follows y[0] = ~u[0][0] * x[0] + ~u[1][0] * x[1] + ~u[2][0] * x[2]; y[1] = ~u[0][1] * x[0] + ~u[1][1] * x[1] + ~u[2][1] * x[2]; y[2] = ~u[0][2] * x[0] + ~u[1][2] * x[1] + ~u[2][2] * x[2]; ~u[0][0] * x[0] :Multiplying conjugate complex re(y[0])=re(u[0][0])*re(x[0]) im(y[0])=re(u[0][0])*im(x[0]) FXPMUL (y[0],u[0][0],x[0]) FXCXNSMA (y[0],u[0][0],x[0],y[0]) re(y[0])+=im(u[0][0])*im(x[0]) im(y[0])+=-im(u[0][0])*re(x[0]) + ~u[1][0] * x[1] + ~u[2][0] * x[2]; Using FMA instruction FXCPMADD (y[0],u[1][0],x[1],y[0]) FXCXNSMA (y[0],u[1][0],x[1],y[0]) FXCPMADD (y[0],u[2][0],x[2],y[0]) FXCXNSMA (y[0],u[2][0],x[2],y[0])

Optimization of instruction pipeline 1 multiplication 2 multiplications 6 cycles for next calculation : pipeline does not stall y[0] = u[0][0] * x[0] + u[0][1] * x[1] + u[0][2] * x[2]; y[1] = u[1][0] * x[0] + u[1][1] * x[1] + u[1][2] * x[2]; y[2] = u[2][0] * x[0] + u[2][1] * x[1] + u[2][2] * x[2]; FXPMUL (yc[0],u[0][0],xa[0]) FXPMUL (yd[0],u[0][0],xb[0]) FXPMUL (yc[1],u[1][0],xa[0]) FXPMUL (yd[1],u[1][0],xb[0]) FXPMUL (yc[2],u[2][0],xa[0]) FXPMUL (yd[2],u[2][0],xb[0]) FXCXNPMA (yc[0],u[0][0],xa[0],yc[0]) FXCXNPMA (yd[0],u[0][0],xb[0],yd[0]) FXCXNPMA (yc[1],u[1][0],xa[0],yc[1]) FXCXNPMA (yd[1],u[1][0],xb[0],yd[1]) FXCXNPMA (yc[2],u[2][0],xa[0],yc[2]) FXCXNPMA (yd[2],u[2][0],xb[0],yd[2]) FXCPMADD (yc[0],u[0][1],xa[1],yc[0]) ... Calculation order 3 cycles to use the result in next calculation : pipeline stalls FXPMUL (y[0],u[0][0],x[0]) FXPMUL (y[1],u[1][0],x[0]) FXPMUL (y[2],u[2][0],x[0]) FXCXNPMA (y[0],u[0][0],x[0],y[0]) FXCXNPMA (y[1],u[1][0],x[0],y[1]) FXCXNPMA (y[2],u[2][0],x[0],y[2]) FXCPMADD (y[0],u[0][1],x[1],y[0]) ... Multiplying 2 spinors Yc = u*Xa Yd = u*Xb together

Loading gauge matrix to register for minus direction ： uxm, uym, uzm, utm operators for plus direction ： uxp, uyp, uzp, utp operators y[0] = ~u[0][0] * x[0] + ~u[1][0] * x[1] + ~u[2][0] * x[2]; y[1] = ~u[0][1] * x[0] + ~u[1][1] * x[1] + ~u[2][1] * x[2]; y[2] = ~u[0][2] * x[0] + ~u[1][2] * x[1] + ~u[2][2] * x[2]; y[0] = u[0][0] * x[0] + u[0][1] * x[1] + u[0][2] * x[2]; y[1] = u[1][0] * x[0] + u[1][1] * x[1] + u[1][2] * x[2]; y[2] = u[2][0] * x[0] + u[2][1] * x[1] + u[2][2] * x[2]; order of matrix in array order of matrix in array Calculation order Calculation order matrix can be loaded sequentially to load matrix sequentially additional 2 temporary registers are used lfpdux u,12 //u[0][0] lfpdux u,13 //u[0][1] lfpdux u,14 //u[0][2] lfpdux u,14 //u[0][0] lfpdux u,15 //u[0][1] lfpdux u,12 //u[0][2] lfpdux u,17 //u[1][0] lfpdux u,16 //u[1][1] lfpdux u,13 //u[1][2] lfpdux u,30 //u[2][0] y=u[0]*x[0] lfpdux u,15 //u[1][0] lfpdux u,16 //u[1][1] lfpdux u,17 //u[1][2] y+=u[1]*x[1] y=u*x[0] lfpdux u,12 //u[2][0] lfpdux u,13 //u[2][1] lfpdux u,14 //u[2][2] lfpdux u,31 //u[2][1] y+=u*x[1] lfpdux u,14 //u[2][2] y+=u[2]*x[2] y+=u*x[2]

Step III: Adding result to 4x3 spinor FR0 FR24 spinor 1 R A R FR1 FR25 spinor 1 G A G FR2 FR26 spinor 1 B A B FR27 FR3 B R spinor 2 R FR28 FR4 B G spinor 2 G FR29 FR5 B B spinor 2 B FR6 spinor 3 R FR7 spinor 3 G FR8 For spinor 3 and 4 spinor 3 B FR9 spinor 4 R v = v - i * w Re(v) = Re(v) + Im(w) Im(v) = Im(v) - Re(w) FR10 spinor 4 G FR11 spinor 4 B double unit[2] = {1,1}; LFPDX 31,unit LFPDX 9,spinor4_R LFPDX 24,A_R FXCXNSMA 9,31,24,9 For spinor 1 and 2 v = v + w Re(v) = Re(v) + Re(w) Im(v) = Im(v) + Im(w) LFPDX 0,spinor1_R LFPDX 24,A_R FPADD 0,0,24 FXCXNSMA to subtract

Multiplying  to 4x3 spinor Multiplying  to every u?p, u?m operators This change increases calculation, but does not increase double FPU instructions This allows out of order calculation of each operators

spinor 1 R A R FR0 FR24 spinor 1 G A G FR1 FR25 spinor 1 B A B FR2 FR26 B R spinor 2 R FR27 FR3 B G spinor 2 G FR28 FR4 B B spinor 2 B FR29 FR5 spinor 3 R FR6 spinor 3 G FR7 spinor 3 B FR8 spinor 4 R FR9 spinor 4 G FR10 spinor 4 B FR11 Adding result to 4x3 spinor with multiplying  For spinor 3 and 4 v = v - i * w double unit[2] = {1,1}; LFPDX 31,unit LFPDX 9,spinor4_R LFPDX 24,A_R FXCXNSMA 9,31,24,9 For spinor 1 and 2 v = v + w LFPDX 0,spinor1_R LFPDX 24,A_R FPADD 0,0,24 v = v -  *i * w Re(v) = Re(v) + *Im(w) Im(v) = Im(v) -  *Re(w) double kappa[2] = {, }; LFPDX 31,kappa LFPDX 9,spinor4_R LFPDX 24,A_R FXCXNSMA 9,31,24,9 v = v +  * w Re(v) = Re(v) +  *Re(w) Im(v) = Im(v) +  *Im(w) LFPDX 0,spinor1_R LFPDX 24,A_R FXCPMADD 0,kappa,0,24

Part II: • Parallelization of lattice QCD and optimization of communication

Parallelization of lattice QCD and optimization of communication • Decreasing communication time as much as we can • Limit data exchange only between neighboring node on torus • Shortest path and never conflict data exchange • Mapping 4D lattice to torus network • MPI is rich enough to do such a limited communication • Overhead to call MPI function is too big for limited conditions • We used torus packet HW directly • Very small latency to send and receive data • We can send/recv directly from/to register of double FPU • We do not need buffer in memory • We can overlap sending 6 directions and local computations • We can hide communication time

Y Y X X Parallelization of lattice QCD Solving on 1 CPU Parallelization of lattice QCD Mapping physical topology of network to avoid conflict of data exchange

Mapping lattice to torus network of Blue Gene • How to divide 4D lattice into 3D torus network • Using virtual node mode and communication between 2 CPUs in same compute node will be 4th dimensional torus • We mapped xyzt of 4D lattice into TXYZ of virtual 4D torus • The fastest communication is between 2 CPUs in compute node • x of lattice is inner most loop of spinor and gauge array Lattice of QCD Z Y Torus network of Blue Gene t z X y CPU0 CPU1 x T

CPU0 CPU1 Data exchange by torus packet network Sending data 16bytes parallel load 16bytes parallel store Packet header destination/size/etc… Send FIFO X+ double FPU register Send FIFO X- 16bytes data Send FIFO Y+ ... 16bytes data Send FIFO Z- 16bytes data ... Store data to memory mapped FIFO to send data to neighboring node 6 FIFOs are independent and can transfer data at the same time Packet size is multiple of 32 bytes up to 256 bytes (including 16 bytes header) Receiving data 16bytes parallel load Recv FIFO X+ double FPU register Recv FIFO X- Recv FIFO Y+ ... Recv FIFO Z-

Exchanging 2 spinors between neighboring nodes Sending 2 spinor to + direction for uym Merging 4 spinors into 2 spinors Store data to FIFO as if it is part of memory FR18 C R Send FIFO X+ FR19 C G FR20 C B Wait until all data is received in FIFO FIFO buffer (1KB) FR21 D R FR22 FR18 D G C R FR23 FR19 D B C G FR20 C B Send directly from register FR21 D R FR22 Recv FIFO X- D G FR23 D B Load data from FIFO Multiplying gauge matrix Size of 2 spinors is 2x3x16 = 96 bytes 1 packet is big enough to send Add to 4x3 spinor

Exchanging 2 spinors between neighboring nodes Sending 2 spinor to - direction for uyp Merging 4 spinors into 2 spinors FR18 C R FR19 C G FR20 C B FR21 D R FR22 D G FR23 D B Store data to FIFO Multiplying gauge matrix Wait until all data is received in FIFO Send FIFO X- FR24 A R FR24 A R FR25 A G FIFO buffer (1KB) FR25 A G FR26 A B FR26 A B FR27 B R FR27 B R FR28 B G FR28 B G FR29 B B FR29 B B Recv FIFO X+ Add to 4x3 spinor Load data from FIFO

Exchanging 2 spinors between 2 CPUs Sending 2 spinor to + direction for uxm CPU0 Merging 4 spinors into 2 spinors CPU1 Store data to shared memory FR18 C R Shared memory FR19 C G Wait until all data is stored FR20 C B Load data from shared memory FR21 D R FR18 FR22 C R D G FR19 FR23 C G D B FR20 C B FR21 D R lockbox barrier FR22 D G Pass barrier after all data is stored FR23 D B Shared memory is not FIFO so 2 CPUs are synchronized to make sure to write all data in shared memory (For safety, it is better to synchronize also before send) Multiplying gauge matrix Add to 4x3 spinor

Overlapping data exchange and computations • Torus packet HW can send 6 direction independently • After storing to FIFO, data communication is non-blocking • We can overlap computation or sending to other direction • But 6 send FIFOs are shared with 2 CPUs in compute node • 2 sets of 3 FIFOs (X+,Y+,Z+) and (X-,Y-,Z-) are assigned for each CPU loop for CPU0 loop for CPU1 SEND_X+ SEND_X- SEND_Y+ SEND_Y- Actual time to transfer data between compute node can be hidden if there is much CPU time between send and recv SEND_Z+ SEND_Z- Local computations Data exchange between 2CPUs Local computations Data exchange between 2CPUs RECV_X- RECV_X+ RECV_Y- RECV_Y+ RECV_Z- RECV_Z+

Special communication API for lattice QCD • Limitation • Only for exchanging data between neighboring nodes and 2 CPUs in compute node • What we can do with API • API function to prepare packet header • API macros to send/recv data from/to FPU register • Communication between node (XYZ) and between CPU (T) can be handled in same way • These macros are used with inline assembly to optimize instruction pipeline with computations • API function for internal barrier between 2 CPUs • API functions to send / recv through user buffer • These functions are used if we do not want to use inline assembly

Comparison of API for QCD and MPI Comparison of bandwidth for ping-pong communication : between 2 CPUs in compute node between 2 neighboring compute node 2-3 times as faster than MPI 10 times as faster than MPI 1 packet = 256 bytes about 40 MB/sec

Sending data using API macros to tell API which register is used to load packet header #define BGLNET_WORK_REG 30 #define BGLNET_HEADER_REG 30 BGLNet_Send_WaitReady(BGLNET_X_PLUS,fifo,6); loop for several times: (calculate something to send) BGLNet_Send_Enqueue_Header(fifo); BGLNet_Send_Enqueue(fifo,24); FPADD(21, 3, 6); BGLNet_Send_Enqueue(fifo,25); FPSUB(18, 0, 9); BGLNet_Send_Enqueue(fifo,26); FPSUB(19, 1,10); BGLNet_Send_Enqueue(fifo,27); FPSUB(20, 2,11); BGLNet_Send_Enqueue(fifo,28); FPADD(22, 4, 7); BGLNet_Send_Enqueue(fifo,29); FPADD(23, 5, 8); BGLNet_Send_Packet(fifo); end of loop: Waits until send FIFO is not empty Sets address of FIFO Loads packet header Sends packet header Sends data from register Computation can be inserted as same as to optimize instruction pipeline with load/store and computation Sends additional 16 bytes

Receiving data using API macros to tell API which register is used to recv packet header #define BGLNET_WORK_REG 30 #define BGLNET_HEADER_REG 30 BGLNet_Recv_WaitReady(BGLNET_X_MINUS,fifo,Nx*8); loop for Nx times: BGLNet_Recv_Dequeue_Header(fifo); FXCSMADD( 0,31,24, 0); BGLNet_Recv_Dequeue(fifo,12); FXCSMADD( 1,31,25, 1); BGLNet_Recv_Dequeue(fifo,13); FXCSMADD( 2,31,26, 2); BGLNet_Recv_Dequeue(fifo,14); FXCSMADD( 3,31,27, 3); BGLNet_Recv_Dequeue(fifo,15); FXCSMADD( 4,31,28, 4); BGLNet_Recv_Dequeue(fifo,16); FXCSMADD( 5,31,29, 5); BGLNet_Recv_Dequeue(fifo,17); FXCPMADD( 6,31,27, 6); BGLNet_Recv_Packet(fifo); end of loop: Waits until all data is received in FIFO buffer Sets address of FIFO Receives packet header Receives data to register Receives additional 16 bytes

Sending data between 2 CPUs Does not wait Does not need packet header Sets address of shared memory to pointer “fifo” BGLNet_Send_WaitReady(BGLNET_T_PLUS,fifo,6); (calculate something to send) BGLNet_Send_Enqueue(fifo,0); FXNMSUB(21,24,31,21); BGLNet_Send_Enqueue(fifo + 1,1); FXNMSUB(18,27,31,18); BGLNet_Send_Enqueue(fifo + 2,2); FXNMSUB(22,25,31,22); BGLNet_Send_Enqueue(fifo + 3,3); FXNMSUB(23,26,31,23); BGLNet_Send_Enqueue(fifo + 4,4); FXNMSUB(19,28,31,19); BGLNet_Send_Enqueue(fifo + 5,5); FXNMSUB(20,29,31,20); BGLNet_InternalBarrier(); Sends data from register Shared memory is not fifo , so we should update address to store next data Barrier function to make sure all data is stored in shared memory Receiver also calls this function before receiving data

Simpler way to access torus packet HW 2 API functions for send and recv can be used without inline assembly Send data from user buffer For dir X,Y or Z User buffer: pData size bytes send FIFO void BGLNet_Send(void* pData,int dir,int size); shared memory For dir = T This function returns as soon as all data are copied to FIFO (non-blocking send) This function can send up to 32 spinors Receive data to user buffer For dir X,Y or Z User buffer: pData size bytes send FIFO void BGLNet_Recv(void* pData,int dir,int size); shared memory This function waits until all data can be received For dir = T

Optimization result of our lattice QCD program Sustained performance per peak performance: weak scaling strong scaling For comparison:

Easier way to optimize to get performance get more performance Built-in functions to optimize double FPU instruction (much easier than inline assembly) Overlapping communication and computations using API for QCD

Summary • Optimization using double FPU instructions • We used inline assembly to optimize complex arithmetic • Parallelization of lattice QCD • Mapping 4D lattice into 4D virtual torus network to limit communication only to neighboring compute node • Optimization of communication • We used torus packet HW directly • We developed API for QCD to use easier

Optimization and tuning techniques of lattice QCD for Blue Gene

Optimization and tuning techniques of lattice QCD for Blue Gene

Presentation Transcript

UKQCD software for lattice QCD

Lattice QCD

Status of Lattice QCD

Code Tuning and Optimization

Designing Lattice QCD Clusters

Lattice QCD

LATTICE QCD AND FLAVOR PHYSICS

Lattice QCD and CLQCD Activities

Lattice QCD

Lattice QCD Clusters

Lattice QCD Comes of Age y

Lattice QCD calculation of Nuclear Forces

Quark-Gluon Plasma and Lattice QCD

Designing Lattice QCD Clusters

LATTICE QCD is FUN

Simulation Algorithms for Lattice QCD

Adaptive Multigrid for Lattice QCD

Lattice QCD (INTRODUCTION)

Lattice QCD and the QCD Vacuum Structure

Lattice QCD for Exotic Hadrons

Lattice QCD study of exotic hadrons

Designing Lattice QCD Clusters