Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
RTM at Petascale and Beyond Michael Perrone IBM Master Inventor Computational Sciences Center, IBM Research
RTM (Reverse Time Migration) Seismic Imaging on BGQ • RTM is a widely-used imaging technique for oil and gas exploration, particularly under subsalts • Over $5 trillion of subsalt oil is believed to exist in the Gulf of Mexico • Imaging subsalt regions of the Earth is extremely challenging • Industry anticipates exascale need by 2020
Bottom Line: Seismic Imaging • We can make RTM 10 to 100 times faster • How? • Abandon embarrassingly parallel RTM • Use domain-partitioned, multisource RTM • System requirements • High communication BW • Low communication latency • Lots of memory Can be extended equally well to FWI
Take Home Messages • Embarrassingly parallel is not always the best approach • It is crucial to know where bottlenecks exist • Algorithmic changes can dramatically improve performance
Compute performance on new hardware Old hardware • Kernel performance improvement New hardware 1 New hardware 2 RunTime
Compute performance on new hardware Old hardware • Need to track end-to-end performance New hardware 1 New hardware 2 Disk IO RunTime
Bottlenecks: Memory IO • GPU: 0.1 B/F • 100 GB/s • 1 TF/s • BG/P: 1.0 B/F • 13.6 GB/s • 13.6 GF/s • BG/Q: 0.2 B/F • 43 GB/s • 204.8 GF/s • BG/Q L2: 1.5 B/F • > 300 GB/s • 204.8 GF/s
GPU’s for Seismic Imaging? • x86/GPU [old results, 2x now] • 17B Stencils / Second • nVidia / INRIA collaboration • Velocity model: 560x560x905 • Iterations: 22760 • BlueGene/P • 40B Stencils / Second • Comparable model size/complexity • Partial optimization • MPI not overlapped • Kernel optimization on-going • BlueGene/Q will be even faster Abdelkhalek, R., Calandra, H., Coulaud, O., Roman, J., Latu, G. 2009. Fast Seismic Modeling and Reverse Time Migration on a GPU Cluster. In International Conference on High Performance Computing & Simulation, 2009. HPCS'09.
Reverse Time Migration (RTM) Source Data: Receiver Data: Ship ~1 km ~5 km 1 Shot
RTM - Reverse Time Migration • Use 3D wave equation to model sound in Earth • Forward (Source): Reverse (Receiver): • Imaging Condition
Implementing the Wave Equation • Finite difference in time: • Finite difference in space: • Absorbing boundary conditions, interpolation, compression, etc.
Image RTM Algorithm (for each shot) t=N F(N) R(N) I(N) t=2N F(2N) R(2N) I(2N) t=3N F(3N) R(3N) I(3N) t=kN F(kN) R(kN) I(kN) • Load data • Velocity model v(x,y,z) • Source & Receiver data • Forward propagation • Calculate P(x,y,z,t) • Every N timesteps • Compress P(x,y,x,t) • Write P(x,y,x,t) to disk/memory • Backward propagation • Calculate P(x,y,z,t) • Every N timesteps • Read P(x,y,x,t) from disk/memory • Decompress P(x,y,x,t) • Calculate partial sum of I(x,y,z) • Merge I(x,y,z) with global image . . . . . . . . .
Slave Node Slave Node Slave Node Disk Disk Disk Process shots in parallel, one per slave node Embarrassingly Parallel RTM Data Archive (Disk) Model Master Node . . . Scratch disk bottleneck Subset of model for each shot (~100k+ shots)
Slave Node Slave Node Slave Node Disk Disk Disk Process all data at once with domain decomposition Domain-Partitioned Multisource RTM Data Archive (Disk) Model Master Node . . . Small partitions mean forward wave can be stored locally: No disks Shots merged and model partitioned
Full Velocity Model Multisource RTM Receiver data Velocity Subset • Linear superposition principal • So N sources can be merged • Finite receiver array acts as nonlinear filter on data • Nonlinearity leads to “crosstalk” noise which needs to be minimized Source Accelerate by factor of N
3D RTM Scaling (Partial optimization) • 512x512x512 & 1024x1024x1024 models • Scaling improves for larger models
GPU Scaling is Comparatively PoorTsubame supercomputerJapan • GPU’s achieve only 10% of peak performance (100x increase for 1000 nodes Okamoto, T., Takenaka, H., Nakamura, T. and Aoki, T. 2010. Accelerating large-scale simulation of seismic wave propagation by multi-GPUs and three-dimensional domain decomposition. In Earth Planets Space, November, 2010.
Physical survey size mapped to BG/Q L2 cache • Isotropic RTM with minimum V = 1.5 km/s • 10 points per wavelength (5 would reduce number below by 8x) • Mapping entire survey volume – not a subset (enables multisource) (512)^3 m^3 (4096)^3 m^3 (16384)^3 m^3 # of Racks Max Imaging Frequency
Snapshot Data Easily Fits in Memory (No disk required) • # of uncompressed snapshots that can be stored for various model sizes and number of nodes. • 4x more capacity for BGQ
Comparison • Embarrassingly parallel RTM • Coarse-grain communication • Coarse-grain synchronization • Disk IO Bottleneck • Partitioned RTM • Fine-grain communication • Fine-grain synchronization • No scratch disk Low latency High bandwidth: Blue Gene
Conclusion: RTM can be dramatically accelerated • Algorithmic: • Adopt partitioned, multisource RTM • Abandon embarrassingly parallel implementations • Hardware: • Increase communication bandwidth • Decrease communication latency • Reduce node nondeterminism • Advantages • Can process larger models - scales well • Avoids scratch disk IO bottleneck • Improves RAS & MTBF: No disk means no moving parts • Disadvantages • Must handle shot “crosstalk” noise • Methods exist - research continuing…