1 / 26

Mapping the FDTD Application to Many-Core Chip Architectures

Computer Architecture and Parallel Systems Laboratory Electrical and Computer Engineering Department University of Delaware. Daniel Orozco Guang Gao. Mapping the FDTD Application to Many-Core Chip Architectures. Outline. What is FDTD?. FDTD = Finite Difference Time Domain

galvin
Download Presentation

Mapping the FDTD Application to Many-Core Chip Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computer Architecture and Parallel Systems Laboratory Electrical and Computer Engineering Department University of Delaware Daniel Orozco GuangGao Mapping the FDTD Application to Many-Core Chip Architectures

  2. Outline Mapping FDTD to Many-Cores ------- Daniel Orozco

  3. What is FDTD? FDTD = Finite Difference Time Domain FDTD simulates the propagation of electromagnetic waves through materials. Scientific Formulation Discretization Iteration Mapping FDTD to Many-Cores Daniel Orozco

  4. A Simple FDTD Computation Mapping FDTD to Many-Cores ------- Daniel Orozco

  5. Memory Wall and Many Core Architectures Mapping FDTD to Many-Cores ------- Daniel Orozco

  6. What is the point of this presentation? What can be done about the off-chip memory bandwidth? Use the on Chip Memory!!! Mapping FDTD to Many-Cores ------- Daniel Orozco

  7. Background: What are Data Dependencies? Data dependencies show the values needed to calculate a particular value. This is a Data Dependency Graph or DDG DDG are useful to know if code transformations are valid. If a particular transformation computes E(1,1) before E(0,2) we know that it is not a valid transformation. Mapping FDTD to Many-Cores ------- Daniel Orozco

  8. Stencil Computations What do they have in common? Read Create New Overwrite Image Processing A lot of Memory Bandwidth is required! How are their Data Dependency Graphs? Solution of Partial Differential Equations Mapping FDTD to Many-Cores ------- Daniel Orozco

  9. Tiling Tiling is the process of calculating only a part of the problem to reduce the memory limitations. No Tiling Tiling Memory Loads Per Element Computed: 9 Memory Loads Per Element Computed: 1.44 Mapping FDTD to Many-Cores ------- Daniel Orozco

  10. Tiling and Parallel Execution Tiling in a 1 Dimensional Algorithm Invalid Tiles Tiles can not be of more than one row due to mutual data dependence. Rows represent successive loads and stores to memory Mapping FDTD to Many-Cores ------- Daniel Orozco

  11. Time Skewing Tiling after Skewing The DDG has been redrawn to show how tiles can go past several vertical directions. Tiles are parallel AND bigger This kind of parallelism is called Wavefront Parallelism and is harder to program than regular tiles. Mapping FDTD to Many-Cores ------- Daniel Orozco

  12. Other Parallel Tiling Approaches:Overlapped Tiling Tile shape Logical View Better Tiling, but There are Redundant Computations Only 50% of the computations are used! Tiles are fully parallel. Lost computations not shown. Mapping FDTD to Many-Cores ------- Daniel Orozco

  13. Other Parallel Tiling Approaches:Split Tiling Tile shape Logical View No Lost Computations This is the state of the art Tiles are fully parallel. No lost computations. Mapping FDTD to Many-Cores ------- Daniel Orozco

  14. Our Contribution: Diamond Tiling Tile shape Logical View No Lost Computations Tiles are fully parallel. No lost computations. Maximum Reuse. Mapping FDTD to Many-Cores ------- Daniel Orozco

  15. Is there a Trick? End of Tile Start of Tile Well, we have tile borders across time iterations…. And we do have to load and store TWO arrays to meet the dependencies. But it’s all for a good cause  Mapping FDTD to Many-Cores ------- Daniel Orozco

  16. We also tried: Triangle Tiling Tile shape Logical View No Lost Computations Tiles are fully parallel. No lost computations. Very simple programming. Mapping FDTD to Many-Cores ------- Daniel Orozco

  17. We also tried: Parametric Tiling Logical View Tiles are fully parallel. No lost computations. Useful to understand the problem. Mapping FDTD to Many-Cores ------- Daniel Orozco

  18. Reuse Reuse is “The key concept” for on-chip memory Number of elements computed Reuse = Number of memory operations Need a connection like this: Why is reuse important? Reuse = 40 20 Cores like this: Reuse = 5 Mapping FDTD to Many-Cores ------- Daniel Orozco

  19. How good are Tiles at Reuse? Developed at CAPSL Diamond Tiling Not Embarrassingly Parallel Parametric Tilingp = 0.5 Skewed Tiling Split Tiling Triangle Tiling Overlapped Tiling No Tiling Simple Tiling The Fine Print: Values are for a tile size of 100. Reuse values change with the size of the tile. Results apply to 1 Dimensional Stencil Computation with dependencies similar to those of the examples. Mapping FDTD to Many-Cores ------- Daniel Orozco

  20. But, Does it Really Work? Diamond Size = 64 Triangle Size = 64 Diamond Size = 16 Triangle Size = 16 No Tiling The Fine Print: Simulated Speedup Results for FDTD 1D running on Cyclops-64 using FAST simulator. Problem size varies for each test, and was selected as big as possible. Only the computation time was measured. Problem data located in DRAM. Tiling done manually. GCC 3.4, -O3 used. Mapping FDTD to Many-Cores ------- Daniel Orozco

  21. Other Considerations Number of elements computed Area O(N2) Reuse = Reuse = Number of memory operations Perimeter O(N) The Reuse is O(N) The best tile is the BIGGEST tile If two tiles have the same width, the one with the MOST AREA has the best reuse. Diamond Size = N Parametric Size = N LowReuse HighReuse LowReuse HighReuse Mapping FDTD to Many-Cores ------- Daniel Orozco

  22. So, Lead Us! • Reuse lowers the required Bandwidth. • Bandwidth is the Limiting Factor for FDTD. • Compute several TIMESTEPS at the same time. And get better performance! Mapping FDTD to Many-Cores ------- Daniel Orozco

  23. Future Work:Multidimensional Diamonds? ???? How are we going to partition THAT??? Mapping FDTD to Many-Cores ------- Daniel Orozco

  24. Future Work: Dataflow Diamonds It’s bad waiting for the slow tile… And then they all compete for Bandwidth at the same time… Dataflow will solve that. Implementation is still a research topic. Mapping FDTD to Many-Cores ------- Daniel Orozco

  25. Multiple Diamond Hierarchies Diamonds work… They use little Bandwidth But we still send the memory back after each Diamond… We have a strong On-Chip Bus. Maybe we can work with a Super Diamond! Mapping FDTD to Many-Cores ------- Daniel Orozco

  26. Questions? Mapping FDTD to Many-Cores ------- Daniel Orozco

More Related