260 likes | 386 Views
CosmoFlow: using Deep Learning to learn the Universe at scale. Debbie Bard . Data Science Engagement Group Lead NERSC.
 
                
                E N D
CosmoFlow: using Deep Learning to learn the Universe at scale Debbie Bard Data Science Engagement Group Lead NERSC
Amrita Mathuriya (Intel), Deborah Bard (NERSC), Peter Mendygral (Cray), Lawrence Meadows (Intel), James Arnemann (UC Berkeley), Lei Shao (Intel), Siyu He (LBNL), Tuomas Karna (Intel), Daina Moise (Cray), John Pennycook (Intel), Kristyn Maschoff (Cray), Jason Sewall (Intel), Nalini Kumar (Intel), Shirley Ho (LBNL), Mike Ringenburg (Cray), Prabhat (NERSC), Victor Lee (Intel) CosmoFlow: a collaboration between LBNL, Intel and Cray
Cosmology What is the universe made of? What are the physical laws that govern the universe? What is the expansion history of the universe?
What does the universe look like? 2DF Galaxy survey mapped galaxy locations in the sky in 3D Further away
Dark Matter in the universe • Matter is not distributed randomly in the universe • Structure of matter is the result of the interplay of: • Initial expansion after the Big Bang • Gravitational pull of matter • Accelerated expansion due to Dark Energy • The distribution of matter tells us about the physics of the universe. Image credit: http://astronomy.swin.edu.au/~gmackie/BigBang/universe.htm
Cosmology • Cosmology is not really an experimental science • Run simulations of theoretical universes, compare to measurements from telescope observations. Image credit: N. MCCURDY (UC-HIPACC) / R. KAEHLER, R. WECHSLER (STANFORD U.) / M. BUSHA (U. ZURICH) / SDSS • Generally compare reduced statistics, e.g. power spectrum. • Lose a lot of information contained in the full structure of matter in the universe..
Machine Learning the universe • Ravanbaksh et al developed a network to predict two cosmological parameters based on 3D distribution of dark matter • Parameter estimation improved by a factor of ~3 compared to traditional methods • Computational issues in taking this further: training takes too long to scale to more data/parameters Ravanbaksh, Oliva, Fromenteau, Price, Ho, Schneider, Poczos; https://arxiv.org/abs/1711.02033
CosmoFlow • Based on pioneering work by CMU team, scale up: • The dataset • The problem size • The number of parameters predicted • Optimise performance of TensorFlow for 3D volumes on KNL • Run fast, effective multi-node training on all of Cori • Predict 3 cosmological parameters from the dark matter distribution
Training data • Ran fast N-body simulations using MUSIC (to generate initial conditions) and pycola (to simulate dark matter) • 512h-1Mpc3 simulation volumes containing 5123 dark matter particles • Evolved using pycola from random density fluctuations provided by MUSIC to a redshift of zero (i.e. today). • Produced 12,632 simulations, varying 3 cosmological parameters • 𝛺m: proportion of matter in the universe (assuming 𝛺m +𝛺DE= 1). • 𝜎8: amplitude of mass fluctuations in the universe at 8Mpc/h scale. • Ns: scalar spectral index of spatial curvature of spacetime. • Use evenly-spaced random sampling of these parameters in ranges based on best experimental measurements.
Training data • Each simulation cube is histogrammed into a 2563-voxel 3D histogram of dark matter particle counts. • Split into 8 sub-volumes -> 101,056 data samples of 1283 voxels • Convert to TFRecord format • Total amount of data produced during simulation production: ~100TB • Training data used: ~1.4TB Example 1283-voxel training sample showing 3D histogram of dark matter distribution
CosmoFlow Network Design • Based on design in Ravanbaksh et al. • 7 convolution layers followed by average-pooling layers with stride of (2,2,2) to reduce the spatial dimensions of the inputs. • 3 fully-connected layers • All layers use leaky Relu as activation function
Changes in the network We made a number of changes to the network to adapt to the problem size, and to optimise performance • Use 1283 voxel volume • Additional convolution layer + average-pooling layer • Retain overall topology while keeping the number of network parameters mangeable with increased problem size • Predict 3 cosmological parameters • Increase number of output neurons to 3 • Performance optimisations • Increase number of output channels for all convolutional layers to a multiple of 16 for efficient vectorisation • Remove batch-norm layers for efficient scaling and compute performance • Use one batch size for all experiments, observed no degradation in accuracy.
Single-node optimisations • Use Intel VTune to identify hotspots, including 3D convolution and pooling operators. • Optimise these operators within MKL-DNN library • Optimize 3D convolution operators : • vectorising inner loops, • applying cache- and register-blocking, • threading. • Other hotspots include element-wise operators in TensorFlow • Optimise with loop-level parallelism with OpenMP
Single-node performance • The majority of floating-point operations occur in the forward and backward convolution layers. • Larger convolutions achieve>1 TFlop/sec. • Smaller convolutions are slower but less significant. • We achieve 535 Gflop/s performance on a single KNL node, including the overhead of I/O and the CPE ML Plugin. Single-node workload breakdown 1 I/O and OpenMP master thread 63 OpenMP threads 4 MPI plugin threads
Multi-node optimisations • Data-parallel approach due to volume of data • Fully synchronous stochastic gradient descent • Every MPI rank is a worker computing gradients • Include gradient aggregation after local gradient calculation, before model update • Uses pool of helper threads for communication, organised into teams • Each thread in team progresses portion of gradient aggregation independently with infrequent synchronisation
Multi-node optimisations • Use Cray CPE ML Plugin • MPI-based, framework-independent plugin for parallelizing DL networks • No modification to Tensorflow, add calls in python training script • Skip parameter servers, update worker nodes directly for efficient training • Avoid “straggler effect” with non-blocking MPI communication to hide node imbalances
Scaling performance • Achieve 77% scaling efficiency at 8192 nodes • Measure walltime per epoch (throughput) • Captures end-to-end capability, including communication, I/O, interconnect, single-node performance… Note: Batch size per node is constant (=-1), so global batch size is # nodes.
Scaling: the importance of IO • Poor scaling on Lustre beyond 512 nodes • 58% at 1024 nodes. • Tests with dummy data (i.e. not read from FS) showed cause was I/O • Use DataWarp (“Burst Buffer”) to achieve 77% scaling efficiency at 8192 nodes • Attribute to: • Higher available read bandwidth from DataWarp • SSDs more suited to random, small read pattern • Less heavily-utilised resources
Convergence • Comparison of loss functions for 2048-node training and 8192-node training shows better convergence in less epochs for fewer nodes. • 8192-node run had not fully converged, even after double the number of training epochs • Considerable effort went into tuning optimisation parameters for large-scale runs; more work will improve convergence behaviour. • Large-scale runs have large effective batch size; hard to converge • Use polynomial decay rate to enable larger learning rates early in training, but slows to aid convergence to a local minimum later in training.
Parameter estimation • Parameter estimation comparable to best experimental uncertainty for 𝛺m and𝜎8, almost 5x smaller for Ns. • We obtain relative errors of (0.0022, 0.0094, 0.0096) for (𝛺m,𝜎8,Ns) with the 2048-node run. • 8192-node run was still learning, but did not yet converge. • Estimates within a factor of 2 of best experimental results • Better accuracy is possible with hyper-parameter/ optimiser tuning
Parameter estimation • Experimental observations using dark matter distribution alone generally do not measure Ns accurately. • This work demonstrates it is possible using machine learning techniques. Dark Energy Survey, https://arxiv.org/abs/1708.01530
Conclusions • CosmoFlow is a highly scalable deep learning application built on top of the TensorFlow framework • We demonstrate fully synchronous data-parallel training on 8192 nodes of Cori with 77% parallel efficiency, achieving 3.5 Pflop/s sustained performance. • To our knowledge, this is the first large-scale science application of the TensorFlow framework at supercomputer scale with fully-synchronous training. • We predict the cosmological parameters 𝛺m,𝜎8andNswith unprecedented accuracy.
Conclusions • Paper accepted to SC18 • Soon to appear on arXiv • Very successful collaboration between NERSC, Intel and Cray experts • Future plans: extending to 4 cosmological parameters, adding time dimension (i.e. redshift snapshots)...
Comparison to Planck CMB-only results Planck collaboration 2016, https://arxiv.org/abs/1502.01589
Optimiser details For all training runs, we use Adam as our base optimizer with 𝛽1 = 0.9 , 𝛽2 = 0.999 , 𝜀= 10e-8 . We combine Adam with the Layer-wise Adaptive Rate Control (LARC) technique and a polynomial (power=1) learning rate decay schedule. LARC is a variant of Layer-wise Adaptive Rate Scaling (LARS) [30]. LARS/LARC adjust the magnitude of the update with respect to the weight norm for each layer for better control of training speed and stability. On top of LARS, LARC includes a clip operation for each layer such that the effective learning rate will not exceed the nominal learning rate for Adam.