RTM at Petascale and Beyond

RTM at Petascale and Beyond Michael Perrone IBM Master Inventor Computational Sciences Center, IBM Research

RTM (Reverse Time Migration) Seismic Imaging on BGQ • RTM is a widely-used imaging technique for oil and gas exploration, particularly under subsalts • Over $5 trillion of subsalt oil is believed to exist in the Gulf of Mexico • Imaging subsalt regions of the Earth is extremely challenging • Industry anticipates exascale need by 2020

Bottom Line: Seismic Imaging • We can make RTM 10 to 100 times faster • How? • Abandon embarrassingly parallel RTM • Use domain-partitioned, multisource RTM • System requirements • High communication BW • Low communication latency • Lots of memory Can be extended equally well to FWI

Take Home Messages • Embarrassingly parallel is not always the best approach • It is crucial to know where bottlenecks exist • Algorithmic changes can dramatically improve performance

Compute performance on new hardware Old hardware • Kernel performance improvement New hardware 1 New hardware 2 RunTime

Compute performance on new hardware Old hardware • Need to track end-to-end performance New hardware 1 New hardware 2 Disk IO RunTime

Bottlenecks: Memory IO • GPU: 0.1 B/F • 100 GB/s • 1 TF/s • BG/P: 1.0 B/F • 13.6 GB/s • 13.6 GF/s • BG/Q: 0.2 B/F • 43 GB/s • 204.8 GF/s • BG/Q L2: 1.5 B/F • > 300 GB/s • 204.8 GF/s

GPU’s for Seismic Imaging? • x86/GPU [old results, 2x now] • 17B Stencils / Second • nVidia / INRIA collaboration • Velocity model: 560x560x905 • Iterations: 22760 • BlueGene/P • 40B Stencils / Second • Comparable model size/complexity • Partial optimization • MPI not overlapped • Kernel optimization on-going • BlueGene/Q will be even faster Abdelkhalek, R., Calandra, H., Coulaud, O., Roman, J., Latu, G. 2009. Fast Seismic Modeling and Reverse Time Migration on a GPU Cluster. In International Conference on High Performance Computing & Simulation, 2009. HPCS'09.

Reverse Time Migration (RTM) Source Data: Receiver Data: Ship ~1 km ~5 km 1 Shot

RTM - Reverse Time Migration • Use 3D wave equation to model sound in Earth • Forward (Source): Reverse (Receiver): • Imaging Condition

Implementing the Wave Equation • Finite difference in time: • Finite difference in space: • Absorbing boundary conditions, interpolation, compression, etc.

Image RTM Algorithm (for each shot) t=N F(N) R(N) I(N) t=2N F(2N) R(2N) I(2N) t=3N F(3N) R(3N) I(3N) t=kN F(kN) R(kN) I(kN) • Load data • Velocity model v(x,y,z) • Source & Receiver data • Forward propagation • Calculate P(x,y,z,t) • Every N timesteps • Compress P(x,y,x,t) • Write P(x,y,x,t) to disk/memory • Backward propagation • Calculate P(x,y,z,t) • Every N timesteps • Read P(x,y,x,t) from disk/memory • Decompress P(x,y,x,t) • Calculate partial sum of I(x,y,z) • Merge I(x,y,z) with global image . . . . . . . . .

Slave Node Slave Node Slave Node Disk Disk Disk Process shots in parallel, one per slave node Embarrassingly Parallel RTM Data Archive (Disk) Model Master Node . . . Scratch disk bottleneck Subset of model for each shot (~100k+ shots)

Slave Node Slave Node Slave Node Disk Disk Disk Process all data at once with domain decomposition Domain-Partitioned Multisource RTM Data Archive (Disk) Model Master Node . . . Small partitions mean forward wave can be stored locally: No disks Shots merged and model partitioned

Full Velocity Model Multisource RTM Receiver data Velocity Subset • Linear superposition principal • So N sources can be merged • Finite receiver array acts as nonlinear filter on data • Nonlinearity leads to “crosstalk” noise which needs to be minimized Source Accelerate by factor of N

3D RTM Scaling (Partial optimization) • 512x512x512 & 1024x1024x1024 models • Scaling improves for larger models

GPU Scaling is Comparatively PoorTsubame supercomputerJapan • GPU’s achieve only 10% of peak performance (100x increase for 1000 nodes Okamoto, T., Takenaka, H., Nakamura, T. and Aoki, T. 2010. Accelerating large-scale simulation of seismic wave propagation by multi-GPUs and three-dimensional domain decomposition. In Earth Planets Space, November, 2010.

Physical survey size mapped to BG/Q L2 cache • Isotropic RTM with minimum V = 1.5 km/s • 10 points per wavelength (5 would reduce number below by 8x) • Mapping entire survey volume – not a subset (enables multisource) (512)^3 m^3 (4096)^3 m^3 (16384)^3 m^3 # of Racks Max Imaging Frequency

Snapshot Data Easily Fits in Memory (No disk required) • # of uncompressed snapshots that can be stored for various model sizes and number of nodes. • 4x more capacity for BGQ

Comparison • Embarrassingly parallel RTM • Coarse-grain communication • Coarse-grain synchronization • Disk IO Bottleneck • Partitioned RTM • Fine-grain communication • Fine-grain synchronization • No scratch disk Low latency High bandwidth: Blue Gene

Conclusion: RTM can be dramatically accelerated • Algorithmic: • Adopt partitioned, multisource RTM • Abandon embarrassingly parallel implementations • Hardware: • Increase communication bandwidth • Decrease communication latency • Reduce node nondeterminism • Advantages • Can process larger models - scales well • Avoids scratch disk IO bottleneck • Improves RAS & MTBF: No disk means no moving parts • Disadvantages • Must handle shot “crosstalk” noise • Methods exist - research continuing…

RTM at Petascale and Beyond

RTM at Petascale and Beyond

Presentation Transcript

Preparing for Petascale and Beyond

Remember The Milk (RTM)

E-Safety at MPL and beyond…..

SPIN at COSY and beyond

Scalable Spectral Transforms at Petascale

Peoples Styles at Work and Beyond

Wave Behaviour at Boundaries and Beyond!

aECG, IDCO and RTM Unification

Supersymmetry at LHC and beyond

Real Science at the Petascale

Petascale

B physics at Belle (and beyond)

EarthWorm at the NEIC and beyond

Cyber-Infrastructure for Materials Simulation at the Petascale

Petascale astronomy and the SKA

A Framework for Visualizing Science at the Petascale and Beyond

Networking at Yale and Beyond

Writing Workshop in RTM

ASSESSMENT IN RTM:

RTM Homes

Core Facilities at UCSF and Beyond