Mapping the FDTD Application to Many-Core Chip Architectures

Computer Architecture and Parallel Systems Laboratory Electrical and Computer Engineering Department University of Delaware Daniel Orozco GuangGao Mapping the FDTD Application to Many-Core Chip Architectures

Outline Mapping FDTD to Many-Cores ------- Daniel Orozco

What is FDTD? FDTD = Finite Difference Time Domain FDTD simulates the propagation of electromagnetic waves through materials. Scientific Formulation Discretization Iteration Mapping FDTD to Many-Cores Daniel Orozco

A Simple FDTD Computation Mapping FDTD to Many-Cores ------- Daniel Orozco

Memory Wall and Many Core Architectures Mapping FDTD to Many-Cores ------- Daniel Orozco

What is the point of this presentation? What can be done about the off-chip memory bandwidth? Use the on Chip Memory!!! Mapping FDTD to Many-Cores ------- Daniel Orozco

Background: What are Data Dependencies? Data dependencies show the values needed to calculate a particular value. This is a Data Dependency Graph or DDG DDG are useful to know if code transformations are valid. If a particular transformation computes E(1,1) before E(0,2) we know that it is not a valid transformation. Mapping FDTD to Many-Cores ------- Daniel Orozco

Stencil Computations What do they have in common? Read Create New Overwrite Image Processing A lot of Memory Bandwidth is required! How are their Data Dependency Graphs? Solution of Partial Differential Equations Mapping FDTD to Many-Cores ------- Daniel Orozco

Tiling Tiling is the process of calculating only a part of the problem to reduce the memory limitations. No Tiling Tiling Memory Loads Per Element Computed: 9 Memory Loads Per Element Computed: 1.44 Mapping FDTD to Many-Cores ------- Daniel Orozco

Tiling and Parallel Execution Tiling in a 1 Dimensional Algorithm Invalid Tiles Tiles can not be of more than one row due to mutual data dependence. Rows represent successive loads and stores to memory Mapping FDTD to Many-Cores ------- Daniel Orozco

Time Skewing Tiling after Skewing The DDG has been redrawn to show how tiles can go past several vertical directions. Tiles are parallel AND bigger This kind of parallelism is called Wavefront Parallelism and is harder to program than regular tiles. Mapping FDTD to Many-Cores ------- Daniel Orozco

Other Parallel Tiling Approaches:Overlapped Tiling Tile shape Logical View Better Tiling, but There are Redundant Computations Only 50% of the computations are used! Tiles are fully parallel. Lost computations not shown. Mapping FDTD to Many-Cores ------- Daniel Orozco

Other Parallel Tiling Approaches:Split Tiling Tile shape Logical View No Lost Computations This is the state of the art Tiles are fully parallel. No lost computations. Mapping FDTD to Many-Cores ------- Daniel Orozco

Our Contribution: Diamond Tiling Tile shape Logical View No Lost Computations Tiles are fully parallel. No lost computations. Maximum Reuse. Mapping FDTD to Many-Cores ------- Daniel Orozco

Is there a Trick? End of Tile Start of Tile Well, we have tile borders across time iterations…. And we do have to load and store TWO arrays to meet the dependencies. But it’s all for a good cause  Mapping FDTD to Many-Cores ------- Daniel Orozco

We also tried: Triangle Tiling Tile shape Logical View No Lost Computations Tiles are fully parallel. No lost computations. Very simple programming. Mapping FDTD to Many-Cores ------- Daniel Orozco

We also tried: Parametric Tiling Logical View Tiles are fully parallel. No lost computations. Useful to understand the problem. Mapping FDTD to Many-Cores ------- Daniel Orozco

Reuse Reuse is “The key concept” for on-chip memory Number of elements computed Reuse = Number of memory operations Need a connection like this: Why is reuse important? Reuse = 40 20 Cores like this: Reuse = 5 Mapping FDTD to Many-Cores ------- Daniel Orozco

How good are Tiles at Reuse? Developed at CAPSL Diamond Tiling Not Embarrassingly Parallel Parametric Tilingp = 0.5 Skewed Tiling Split Tiling Triangle Tiling Overlapped Tiling No Tiling Simple Tiling The Fine Print: Values are for a tile size of 100. Reuse values change with the size of the tile. Results apply to 1 Dimensional Stencil Computation with dependencies similar to those of the examples. Mapping FDTD to Many-Cores ------- Daniel Orozco

But, Does it Really Work? Diamond Size = 64 Triangle Size = 64 Diamond Size = 16 Triangle Size = 16 No Tiling The Fine Print: Simulated Speedup Results for FDTD 1D running on Cyclops-64 using FAST simulator. Problem size varies for each test, and was selected as big as possible. Only the computation time was measured. Problem data located in DRAM. Tiling done manually. GCC 3.4, -O3 used. Mapping FDTD to Many-Cores ------- Daniel Orozco

Other Considerations Number of elements computed Area O(N2) Reuse = Reuse = Number of memory operations Perimeter O(N) The Reuse is O(N) The best tile is the BIGGEST tile If two tiles have the same width, the one with the MOST AREA has the best reuse. Diamond Size = N Parametric Size = N LowReuse HighReuse LowReuse HighReuse Mapping FDTD to Many-Cores ------- Daniel Orozco

So, Lead Us! • Reuse lowers the required Bandwidth. • Bandwidth is the Limiting Factor for FDTD. • Compute several TIMESTEPS at the same time. And get better performance! Mapping FDTD to Many-Cores ------- Daniel Orozco

Future Work:Multidimensional Diamonds? ???? How are we going to partition THAT??? Mapping FDTD to Many-Cores ------- Daniel Orozco

Future Work: Dataflow Diamonds It’s bad waiting for the slow tile… And then they all compete for Bandwidth at the same time… Dataflow will solve that. Implementation is still a research topic. Mapping FDTD to Many-Cores ------- Daniel Orozco

Multiple Diamond Hierarchies Diamonds work… They use little Bandwidth But we still send the memory back after each Diamond… We have a strong On-Chip Bus. Maybe we can work with a Super Diamond! Mapping FDTD to Many-Cores ------- Daniel Orozco

Questions? Mapping FDTD to Many-Cores ------- Daniel Orozco

Mapping the FDTD Application to Many-Core Chip Architectures