108 Views

Download Presentation
##### RTM at Petascale and Beyond

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**RTM at Petascale and Beyond**Michael Perrone IBM Master Inventor Computational Sciences Center, IBM Research**RTM (Reverse Time Migration) Seismic Imaging on BGQ**• RTM is a widely-used imaging technique for oil and gas exploration, particularly under subsalts • Over $5 trillion of subsalt oil is believed to exist in the Gulf of Mexico • Imaging subsalt regions of the Earth is extremely challenging • Industry anticipates exascale need by 2020**Bottom Line: Seismic Imaging**• We can make RTM 10 to 100 times faster • How? • Abandon embarrassingly parallel RTM • Use domain-partitioned, multisource RTM • System requirements • High communication BW • Low communication latency • Lots of memory Can be extended equally well to FWI**Take Home Messages**• Embarrassingly parallel is not always the best approach • It is crucial to know where bottlenecks exist • Algorithmic changes can dramatically improve performance**Compute performance on new hardware**Old hardware • Kernel performance improvement New hardware 1 New hardware 2 RunTime**Compute performance on new hardware**Old hardware • Need to track end-to-end performance New hardware 1 New hardware 2 Disk IO RunTime**Bottlenecks: Memory IO**• GPU: 0.1 B/F • 100 GB/s • 1 TF/s • BG/P: 1.0 B/F • 13.6 GB/s • 13.6 GF/s • BG/Q: 0.2 B/F • 43 GB/s • 204.8 GF/s • BG/Q L2: 1.5 B/F • > 300 GB/s • 204.8 GF/s**GPU’s for Seismic Imaging?**• x86/GPU [old results, 2x now] • 17B Stencils / Second • nVidia / INRIA collaboration • Velocity model: 560x560x905 • Iterations: 22760 • BlueGene/P • 40B Stencils / Second • Comparable model size/complexity • Partial optimization • MPI not overlapped • Kernel optimization on-going • BlueGene/Q will be even faster Abdelkhalek, R., Calandra, H., Coulaud, O., Roman, J., Latu, G. 2009. Fast Seismic Modeling and Reverse Time Migration on a GPU Cluster. In International Conference on High Performance Computing & Simulation, 2009. HPCS'09.**Reverse Time Migration (RTM)**Source Data: Receiver Data: Ship ~1 km ~5 km 1 Shot**RTM - Reverse Time Migration**• Use 3D wave equation to model sound in Earth • Forward (Source): Reverse (Receiver): • Imaging Condition**Implementing the Wave Equation**• Finite difference in time: • Finite difference in space: • Absorbing boundary conditions, interpolation, compression, etc.**Image**RTM Algorithm (for each shot) t=N F(N) R(N) I(N) t=2N F(2N) R(2N) I(2N) t=3N F(3N) R(3N) I(3N) t=kN F(kN) R(kN) I(kN) • Load data • Velocity model v(x,y,z) • Source & Receiver data • Forward propagation • Calculate P(x,y,z,t) • Every N timesteps • Compress P(x,y,x,t) • Write P(x,y,x,t) to disk/memory • Backward propagation • Calculate P(x,y,z,t) • Every N timesteps • Read P(x,y,x,t) from disk/memory • Decompress P(x,y,x,t) • Calculate partial sum of I(x,y,z) • Merge I(x,y,z) with global image . . . . . . . . .**Slave Node**Slave Node Slave Node Disk Disk Disk Process shots in parallel, one per slave node Embarrassingly Parallel RTM Data Archive (Disk) Model Master Node . . . Scratch disk bottleneck Subset of model for each shot (~100k+ shots)**Slave Node**Slave Node Slave Node Disk Disk Disk Process all data at once with domain decomposition Domain-Partitioned Multisource RTM Data Archive (Disk) Model Master Node . . . Small partitions mean forward wave can be stored locally: No disks Shots merged and model partitioned**Full Velocity Model**Multisource RTM Receiver data Velocity Subset • Linear superposition principal • So N sources can be merged • Finite receiver array acts as nonlinear filter on data • Nonlinearity leads to “crosstalk” noise which needs to be minimized Source Accelerate by factor of N**3D RTM Scaling (Partial optimization)**• 512x512x512 & 1024x1024x1024 models • Scaling improves for larger models**GPU Scaling is Comparatively PoorTsubame supercomputerJapan**• GPU’s achieve only 10% of peak performance (100x increase for 1000 nodes Okamoto, T., Takenaka, H., Nakamura, T. and Aoki, T. 2010. Accelerating large-scale simulation of seismic wave propagation by multi-GPUs and three-dimensional domain decomposition. In Earth Planets Space, November, 2010.**Physical survey size mapped to BG/Q L2 cache**• Isotropic RTM with minimum V = 1.5 km/s • 10 points per wavelength (5 would reduce number below by 8x) • Mapping entire survey volume – not a subset (enables multisource) (512)^3 m^3 (4096)^3 m^3 (16384)^3 m^3 # of Racks Max Imaging Frequency**Snapshot Data Easily Fits in Memory (No disk required)**• # of uncompressed snapshots that can be stored for various model sizes and number of nodes. • 4x more capacity for BGQ**Comparison**• Embarrassingly parallel RTM • Coarse-grain communication • Coarse-grain synchronization • Disk IO Bottleneck • Partitioned RTM • Fine-grain communication • Fine-grain synchronization • No scratch disk Low latency High bandwidth: Blue Gene**Conclusion: RTM can be dramatically accelerated**• Algorithmic: • Adopt partitioned, multisource RTM • Abandon embarrassingly parallel implementations • Hardware: • Increase communication bandwidth • Decrease communication latency • Reduce node nondeterminism • Advantages • Can process larger models - scales well • Avoids scratch disk IO bottleneck • Improves RAS & MTBF: No disk means no moving parts • Disadvantages • Must handle shot “crosstalk” noise • Methods exist - research continuing…