Loading in 5 sec....

Cell processor implementation of a MILC lattice QCD applicationPowerPoint Presentation

Cell processor implementation of a MILC lattice QCD application

Download Presentation

Cell processor implementation of a MILC lattice QCD application

Loading in 2 Seconds...

- 124 Views
- Uploaded on
- Presentation posted in: General

Cell processor implementation of a MILC lattice QCD application

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Cell processor implementation of a MILC lattice QCD application

Guochun Shi, Volodymyr Kindratenko, Steven Gottlieb

- Introduction
- Our view of MILC applications
- Introduction to Cell Broadband Engine

- Implementation in Cell/B.E.
- PPU performance and stream benchmark
- Profile in CPU and kernels to be ported
- Different approaches

- Performance
- Conclusion

CPU

MPI scatter/gather for loop 2

- Our target
- MIMD Lattice Computation (MILC) Collaboration code – dynamical clover fermions (clover_dynamical) using the hybrid-molecular dynamics R algorithm

- Our view of the MILC applications
- A sequence of communication and computation blocks

compute loop 1

MPI scatter/gather for loop 3

compute loop 2

MPI scatter/gather for loop n+1

compute loop n

MPI scatter/gather

Original CPU-based implementation

- Cell/B.E. processor
- One Power Processor Element (PPE) and eight Synergistic Processing Elements (SPE), each SPE has 256 KBs of local storage
- 3.2 GHz processor
- 25.6 GB/s processor-to-memory bandwidth
- > 200 GB/s EIB sustained aggregate bandwidth
- Theoretical peak performance: 204.8 GFLOPS (SP) and 14.63 GFLOPS (DP)

- Introduction
- Our view of MILC applications
- Introduction to Cell Broadband Engine

- Implementation in Cell/B.E.
- PPE performance and stream benchmark
- Profile in CPU and kernels to be ported
- Different approaches

- Performance
- Conclusion

- Step 1: try to run it in PPE
- In PPE it runs approximately ~2-3x slower than modern CPU
- MILC is bandwidth-bound
- It agrees with what we see with stream benchmark

- 10 of these subroutines are responsible for >90% of overall runtime
- All kernels responsible for 98.8%

lattice site 0

Data from neighbor

Data accesses

#define FORSOMEPARITY(i,s,choice) \

for( i=((choice)==ODD ? even_sites_on_node : 0 ), \

s= &(lattice[i]); \

i< ( (choice)==EVEN ? even_sites_on_node : sites_on_node); \

i++,s++)

FORSOMEPARITY(i,s,parity) {

mult_adj_mat_wilson_vec( &(s->link[nu]), ((wilson_vector *)F_PT(s,rsrc)), &rtemp );

mult_adj_mat_wilson_vec( (su3_matrix *)(gen_pt[1][i]), &rtemp, &(s->tmp) );

mult_mat_wilson_vec( (su3_matrix *)(gen_pt[0][i]), &(s->tmp), &rtemp );

mult_mat_wilson_vec( &(s->link[mu]), &rtemp, &(s->tmp) );

mult_sigma_mu_nu( &(s->tmp), &rtemp, mu, nu );

su3_projector_w( &rtemp, ((wilson_vector *)F_PT(s,lsrc)), ((su3_matrix*)F_PT(s,mat)) );

}

- Kernel codes must be SIMDized
- Performance determined by how fast you DMA in/out the data, not by SIMDized code
- In each iteration, only small elements are accessed
- Lattice size: 1832 bytes
- su3_matrix: 72 bytes
- wilson_vector: 96 bytes

- Challenge: how to get data into SPUs as fast as possible?
- Cell/B.E. has the best DMA performance when data is aligned to 128 bytes and size is multiple of 128 bytes.
- Data layout in MILC meets neither of them

One sample kernel from udadu_mu_nu() routine

PPE and main memory

SPEs

struct site

…

Packing

DMA operations

…

Unpacking

…

DMA operations

- Good performance in DMA operations
- Packing and unpacking are expensive in PPE

struct site

SPEs

…

Original lattice

…

Modified lattice

Continuous mem

…

DMA operations

PPE and main memory

- Replace elements in struct site with pointers
- Pointers point to continuous memory regions
- PPE overhead due to indirect memory access

SPEs

PPE and memory

…

Original lattice

…

Lattice after padding

…

DMA operations

- Padding elements to appropriate size
- Padding struct site to appropriate size
- Gained good bandwidth performance with padding overhead
- Su3_matrix from 3x3 complex to 4x4 complex matrix
- 72 bytes 128 bytes
- Bandwidth efficiency lost: 44%

- Wilson_vector from 4x3 complex to 4x4 complex
- 98 bytes 128 bytes
- Bandwidth efficiency lost: 23%

- 128 byte stride access has different performance for different stride size
- This is due to 16 banks in main memory
- Odd numbers always reach peak
- We choose to pad the struct site to 2688 (21*128) bytes

- Introduction
- Our view of MILC applications
- Introduction to Cell Broadband Engine

- Implementation in Cell/BE.
- PPU performance and stream benchmark
- Profile in CPU and kernels to be ported
- Different approaches

- Performance
- Conclusion

- GFLOPS are low for all kernels
- Bandwidth is around 80% of peak for most of kernels
- Kernel speedup compared to CPU for most of kernels are between 10x to 20x
- set_memory_to_zero kernel has ~40x speedup, su3mat_copy() speedup >15x

8x8x16x16 lattice

- Single Cell Application performance speedup
- ~8–10x, compared to Xeon single core

- Cell Blade application performance speedup
- 1.5-4.1x, compared to Xeon 2 socket 8 cores

- Profile in Xeon
- 98.8% parallel code, 1.2% serial code
speedup slowdown

- 67-38% kernel SPU time, 33-62% PPU time of overall runtime in Cell
PPE is standing in the way for further improvement

- 98.8% parallel code, 1.2% serial code

16x16x16x16 lattice

- For comparison, we ran two Intel Xeon blades and Cell/B.E. blades through Gigabit Ethernet
- More data needed for Cell blades connected through Infiniband

- PPE is slower than Xeon
- PPE + 1 SPE is ~2x faster than Xeon
- A cell blade is 1.5-4.1x faster than 8-core Xeon blade

- We achieved reasonably good performance
- 4.5-5.0 Gflops in one Cell processor for whole application

- We maintained the MPI framework
- Without the assumtion that the code runs on one Cell processor, certain optimization cannot be done, e.g. loop fusion

- Current site-centric data layout forces us to take the padding approach
- 23-44% efficiency lost for bandwidth
- Fix: field-centric data layout desired

- PPE slows the serial part, which is a problem for further improvement
- Fix: IBM putting a full-version power core in Cell/B.E.

- PPE may impose problems in scaling to multiple Cell blades
- PPE over Infiniband test needed