Cell processor implementation of a milc lattice qcd application
This presentation is the property of its rightful owner.
Sponsored Links
1 / 18

Cell processor implementation of a MILC lattice QCD application PowerPoint PPT Presentation


  • 109 Views
  • Uploaded on
  • Presentation posted in: General

Cell processor implementation of a MILC lattice QCD application. Guochun Shi, Volodymyr Kindratenko, Steven Gottlieb. Presentation outline. Introduction Our view of MILC applications Introduction to Cell Broadband Engine Implementation in Cell/B.E. PPU performance and stream benchmark

Download Presentation

Cell processor implementation of a MILC lattice QCD application

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Cell processor implementation of a milc lattice qcd application

Cell processor implementation of a MILC lattice QCD application

Guochun Shi, Volodymyr Kindratenko, Steven Gottlieb


Presentation outline

Presentation outline

  • Introduction

    • Our view of MILC applications

    • Introduction to Cell Broadband Engine

  • Implementation in Cell/B.E.

    • PPU performance and stream benchmark

    • Profile in CPU and kernels to be ported

    • Different approaches

  • Performance

  • Conclusion


Introduction

Introduction

CPU

MPI scatter/gather for loop 2

  • Our target

    • MIMD Lattice Computation (MILC) Collaboration code – dynamical clover fermions (clover_dynamical) using the hybrid-molecular dynamics R algorithm

  • Our view of the MILC applications

    • A sequence of communication and computation blocks

compute loop 1

MPI scatter/gather for loop 3

compute loop 2

MPI scatter/gather for loop n+1

compute loop n

MPI scatter/gather

Original CPU-based implementation


Introduction1

Introduction

  • Cell/B.E. processor

    • One Power Processor Element (PPE) and eight Synergistic Processing Elements (SPE), each SPE has 256 KBs of local storage

    • 3.2 GHz processor

    • 25.6 GB/s processor-to-memory bandwidth

    • > 200 GB/s EIB sustained aggregate bandwidth

    • Theoretical peak performance: 204.8 GFLOPS (SP) and 14.63 GFLOPS (DP)


Presentation outline1

Presentation outline

  • Introduction

    • Our view of MILC applications

    • Introduction to Cell Broadband Engine

  • Implementation in Cell/B.E.

    • PPE performance and stream benchmark

    • Profile in CPU and kernels to be ported

    • Different approaches

  • Performance

  • Conclusion


Performance in ppe

Performance in PPE

  • Step 1: try to run it in PPE

  • In PPE it runs approximately ~2-3x slower than modern CPU

  • MILC is bandwidth-bound

  • It agrees with what we see with stream benchmark


Execution profile and kernels to be ported

Execution profile and kernels to be ported

  • 10 of these subroutines are responsible for >90% of overall runtime

  • All kernels responsible for 98.8%


Kernel memory access pattern

lattice site 0

Data from neighbor

Data accesses

Kernel memory access pattern

#define FORSOMEPARITY(i,s,choice) \

for( i=((choice)==ODD ? even_sites_on_node : 0 ), \

s= &(lattice[i]); \

i< ( (choice)==EVEN ? even_sites_on_node : sites_on_node); \

i++,s++)

FORSOMEPARITY(i,s,parity) {

mult_adj_mat_wilson_vec( &(s->link[nu]), ((wilson_vector *)F_PT(s,rsrc)), &rtemp );

mult_adj_mat_wilson_vec( (su3_matrix *)(gen_pt[1][i]), &rtemp, &(s->tmp) );

mult_mat_wilson_vec( (su3_matrix *)(gen_pt[0][i]), &(s->tmp), &rtemp );

mult_mat_wilson_vec( &(s->link[mu]), &rtemp, &(s->tmp) );

mult_sigma_mu_nu( &(s->tmp), &rtemp, mu, nu );

su3_projector_w( &rtemp, ((wilson_vector *)F_PT(s,lsrc)), ((su3_matrix*)F_PT(s,mat)) );

}

  • Kernel codes must be SIMDized

  • Performance determined by how fast you DMA in/out the data, not by SIMDized code

  • In each iteration, only small elements are accessed

    • Lattice size: 1832 bytes

    • su3_matrix: 72 bytes

    • wilson_vector: 96 bytes

  • Challenge: how to get data into SPUs as fast as possible?

    • Cell/B.E. has the best DMA performance when data is aligned to 128 bytes and size is multiple of 128 bytes.

    • Data layout in MILC meets neither of them

One sample kernel from udadu_mu_nu() routine


Approach i packing and unpacking

Approach I: packing and unpacking

PPE and main memory

SPEs

struct site

Packing

DMA operations

Unpacking

DMA operations

  • Good performance in DMA operations

  • Packing and unpacking are expensive in PPE

struct site


Approach ii indirect memory access

SPEs

Original lattice

Modified lattice

Continuous mem

DMA operations

Approach II: Indirect memory access

PPE and main memory

  • Replace elements in struct site with pointers

  • Pointers point to continuous memory regions

  • PPE overhead due to indirect memory access


Approach iii padding and small memory dmas

SPEs

PPE and memory

Original lattice

Lattice after padding

DMA operations

Approach III: Padding and small memory DMAs

  • Padding elements to appropriate size

  • Padding struct site to appropriate size

  • Gained good bandwidth performance with padding overhead

  • Su3_matrix from 3x3 complex to 4x4 complex matrix

    • 72 bytes  128 bytes

    • Bandwidth efficiency lost: 44%

  • Wilson_vector from 4x3 complex to 4x4 complex

    • 98 bytes  128 bytes

    • Bandwidth efficiency lost: 23%


Struct site padding

Struct site Padding

  • 128 byte stride access has different performance for different stride size

  • This is due to 16 banks in main memory

  • Odd numbers always reach peak

  • We choose to pad the struct site to 2688 (21*128) bytes


Presentation outline2

Presentation outline

  • Introduction

    • Our view of MILC applications

    • Introduction to Cell Broadband Engine

  • Implementation in Cell/BE.

    • PPU performance and stream benchmark

    • Profile in CPU and kernels to be ported

    • Different approaches

  • Performance

  • Conclusion


Kernel performance

Kernel performance

  • GFLOPS are low for all kernels

  • Bandwidth is around 80% of peak for most of kernels

  • Kernel speedup compared to CPU for most of kernels are between 10x to 20x

  • set_memory_to_zero kernel has ~40x speedup, su3mat_copy() speedup >15x


Application performance

Application performance

8x8x16x16 lattice

  • Single Cell Application performance speedup

    • ~8–10x, compared to Xeon single core

  • Cell Blade application performance speedup

    • 1.5-4.1x, compared to Xeon 2 socket 8 cores

  • Profile in Xeon

    • 98.8% parallel code, 1.2% serial code

      speedup slowdown

    • 67-38% kernel SPU time, 33-62% PPU time of overall runtime in Cell

      PPE is standing in the way for further improvement

16x16x16x16 lattice


Application performance on two blades

Application performance on two blades

  • For comparison, we ran two Intel Xeon blades and Cell/B.E. blades through Gigabit Ethernet

  • More data needed for Cell blades connected through Infiniband


Application performance a fair comparison

Application performance: a fair comparison

  • PPE is slower than Xeon

  • PPE + 1 SPE is ~2x faster than Xeon

  • A cell blade is 1.5-4.1x faster than 8-core Xeon blade


Conclusion

Conclusion

  • We achieved reasonably good performance

    • 4.5-5.0 Gflops in one Cell processor for whole application

  • We maintained the MPI framework

    • Without the assumtion that the code runs on one Cell processor, certain optimization cannot be done, e.g. loop fusion

  • Current site-centric data layout forces us to take the padding approach

    • 23-44% efficiency lost for bandwidth

    • Fix: field-centric data layout desired

  • PPE slows the serial part, which is a problem for further improvement

    • Fix: IBM putting a full-version power core in Cell/B.E.

  • PPE may impose problems in scaling to multiple Cell blades

    • PPE over Infiniband test needed


  • Login