rtm at petascale and beyond n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
RTM at Petascale and Beyond PowerPoint Presentation
Download Presentation
RTM at Petascale and Beyond

Loading in 2 Seconds...

play fullscreen
1 / 21
dwayne

RTM at Petascale and Beyond - PowerPoint PPT Presentation

108 Views
Download Presentation
RTM at Petascale and Beyond
An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. RTM at Petascale and Beyond Michael Perrone IBM Master Inventor Computational Sciences Center, IBM Research

  2. RTM (Reverse Time Migration) Seismic Imaging on BGQ • RTM is a widely-used imaging technique for oil and gas exploration, particularly under subsalts • Over $5 trillion of subsalt oil is believed to exist in the Gulf of Mexico • Imaging subsalt regions of the Earth is extremely challenging • Industry anticipates exascale need by 2020

  3. Bottom Line: Seismic Imaging • We can make RTM 10 to 100 times faster • How? • Abandon embarrassingly parallel RTM • Use domain-partitioned, multisource RTM • System requirements • High communication BW • Low communication latency • Lots of memory Can be extended equally well to FWI

  4. Take Home Messages • Embarrassingly parallel is not always the best approach • It is crucial to know where bottlenecks exist • Algorithmic changes can dramatically improve performance

  5. Compute performance on new hardware Old hardware • Kernel performance improvement New hardware 1 New hardware 2 RunTime

  6. Compute performance on new hardware Old hardware • Need to track end-to-end performance New hardware 1 New hardware 2 Disk IO RunTime

  7. Bottlenecks: Memory IO • GPU: 0.1 B/F • 100 GB/s • 1 TF/s • BG/P: 1.0 B/F • 13.6 GB/s • 13.6 GF/s • BG/Q: 0.2 B/F • 43 GB/s • 204.8 GF/s • BG/Q L2: 1.5 B/F • > 300 GB/s • 204.8 GF/s

  8. GPU’s for Seismic Imaging? • x86/GPU [old results, 2x now] • 17B Stencils / Second • nVidia / INRIA collaboration • Velocity model: 560x560x905 • Iterations: 22760 • BlueGene/P • 40B Stencils / Second • Comparable model size/complexity • Partial optimization • MPI not overlapped • Kernel optimization on-going • BlueGene/Q will be even faster Abdelkhalek, R., Calandra, H., Coulaud, O., Roman, J., Latu, G. 2009. Fast Seismic Modeling and Reverse Time Migration on a GPU Cluster. In International Conference on High Performance Computing & Simulation, 2009. HPCS'09.

  9. Reverse Time Migration (RTM) Source Data: Receiver Data: Ship ~1 km ~5 km 1 Shot

  10. RTM - Reverse Time Migration • Use 3D wave equation to model sound in Earth • Forward (Source): Reverse (Receiver): • Imaging Condition

  11. Implementing the Wave Equation • Finite difference in time: • Finite difference in space: • Absorbing boundary conditions, interpolation, compression, etc.

  12. Image RTM Algorithm (for each shot) t=N F(N) R(N) I(N) t=2N F(2N) R(2N) I(2N) t=3N F(3N) R(3N) I(3N) t=kN F(kN) R(kN) I(kN) • Load data • Velocity model v(x,y,z) • Source & Receiver data • Forward propagation • Calculate P(x,y,z,t) • Every N timesteps • Compress P(x,y,x,t) • Write P(x,y,x,t) to disk/memory • Backward propagation • Calculate P(x,y,z,t) • Every N timesteps • Read P(x,y,x,t) from disk/memory • Decompress P(x,y,x,t) • Calculate partial sum of I(x,y,z) • Merge I(x,y,z) with global image . . . . . . . . .

  13. Slave Node Slave Node Slave Node Disk Disk Disk Process shots in parallel, one per slave node Embarrassingly Parallel RTM Data Archive (Disk) Model Master Node . . . Scratch disk bottleneck Subset of model for each shot (~100k+ shots)

  14. Slave Node Slave Node Slave Node Disk Disk Disk Process all data at once with domain decomposition Domain-Partitioned Multisource RTM Data Archive (Disk) Model Master Node . . . Small partitions mean forward wave can be stored locally: No disks Shots merged and model partitioned

  15. Full Velocity Model Multisource RTM Receiver data Velocity Subset • Linear superposition principal • So N sources can be merged • Finite receiver array acts as nonlinear filter on data • Nonlinearity leads to “crosstalk” noise which needs to be minimized Source Accelerate by factor of N

  16. 3D RTM Scaling (Partial optimization) • 512x512x512 & 1024x1024x1024 models • Scaling improves for larger models

  17. GPU Scaling is Comparatively PoorTsubame supercomputerJapan • GPU’s achieve only 10% of peak performance (100x increase for 1000 nodes Okamoto, T., Takenaka, H., Nakamura, T. and Aoki, T. 2010. Accelerating large-scale simulation of seismic wave propagation by multi-GPUs and three-dimensional domain decomposition. In Earth Planets Space, November, 2010.

  18. Physical survey size mapped to BG/Q L2 cache • Isotropic RTM with minimum V = 1.5 km/s • 10 points per wavelength (5 would reduce number below by 8x) • Mapping entire survey volume – not a subset (enables multisource) (512)^3 m^3 (4096)^3 m^3 (16384)^3 m^3 # of Racks Max Imaging Frequency

  19. Snapshot Data Easily Fits in Memory (No disk required) • # of uncompressed snapshots that can be stored for various model sizes and number of nodes. • 4x more capacity for BGQ

  20. Comparison • Embarrassingly parallel RTM • Coarse-grain communication • Coarse-grain synchronization • Disk IO Bottleneck • Partitioned RTM • Fine-grain communication • Fine-grain synchronization • No scratch disk Low latency High bandwidth: Blue Gene

  21. Conclusion: RTM can be dramatically accelerated • Algorithmic: • Adopt partitioned, multisource RTM • Abandon embarrassingly parallel implementations • Hardware: • Increase communication bandwidth • Decrease communication latency • Reduce node nondeterminism • Advantages • Can process larger models - scales well • Avoids scratch disk IO bottleneck • Improves RAS & MTBF: No disk means no moving parts • Disadvantages • Must handle shot “crosstalk” noise • Methods exist - research continuing…