# Stochastic Training is Not Necessary for Generalization

@article{Geiping2021StochasticTI, title={Stochastic Training is Not Necessary for Generalization}, author={Jonas Geiping and Micah Goldblum and Phillip E. Pope and Michael Moeller and Tom Goldstein}, journal={ArXiv}, year={2021}, volume={abs/2109.14119} }

It is widely believed that the implicit regularization of stochastic gradient descent (SGD) is fundamental to the impressive generalization behavior we observe in neural networks. In this work, we demonstrate that non-stochastic full-batch training can achieve strong performance on CIFAR-10 that is on-par with SGD, using modern architectures in settings with and without data augmentation. To this end, we utilize modified hyperparameters and show that the implicit regularization of SGD can be… Expand

#### 5 Citations

On the Implicit Biases of Architecture & Gradient Descent

- Computer Science
- ArXiv
- 2021

It is found that while typical networks that fit the training data already generalise fairly well, gradient descent can further improve generalisation by selecting networks with a large margin, based on a careful study of the behaviour of infinite width networkstrained by Bayesian inference and finite width networks trained by gradient descent. Expand

The Equilibrium Hypothesis: Rethinking implicit regularization in Deep Neural Networks

- Computer Science, Mathematics
- ArXiv
- 2021

The Equilibrium Hypothesis is introduced and empirically validate, which states that the layers that achieve some balance between forward and backward information loss are the ones with the highest alignment to data labels. Expand

Noether's Learning Dynamics: Role of Symmetry Breaking in Neural Networks

- Computer Science, Physics
- 2021

A theoretical framework is developed to study the geometry of learning dynamics in neural networks, and a key mechanism of explicit symmetry breaking is revealed behind the efficiency and stability of modern neural networks. Expand

Depth Without the Magic: Inductive Bias of Natural Gradient Descent

- Mathematics, Computer Science
- 2021

In gradient descent, changing how we parametrize the model can lead to drastically different optimization trajectories, giving rise to a surprising range of meaningful inductive biases: identifying… Expand

Subspace Adversarial Training

- Computer Science
- 2021

Single-step adversarial training (AT) has received wide attention as it proved to be both efficient and robust. However, a serious problem of catastrophic overfitting exists, i.e., the robust… Expand

#### References

SHOWING 1-10 OF 74 REFERENCES

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

- Computer Science, Mathematics
- ICLR
- 2017

This work investigates the cause for this generalization drop in the large-batch regime and presents numerical evidence that supports the view that large- batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization. Expand

On the Generalization Benefit of Noise in Stochastic Gradient Descent

- Computer Science, Mathematics
- ICML
- 2020

This paper performs carefully designed experiments and rigorous hyperparameter sweeps on a range of popular models, which verify that small or moderately large batch sizes can substantially outperform very large batches on the test set. Expand

Towards Theoretical Understanding of Large Batch Training in Stochastic Gradient Descent

- Mathematics, Computer Science
- ArXiv
- 2018

It is proved that SGD tends to converge to flatter minima in the asymptotic regime (although may take exponential time to converge) regardless of the batch size, and thatSGD with a larger ratio of learning rate to batch size tends to convergence to a flat minimum faster, however, its generalization performance could be worse. Expand

Stochastic Gradient Descent Performs Variational Inference, Converges to Limit Cycles for Deep Networks

- Mathematics, Computer Science
- 2018 Information Theory and Applications Workshop (ITA)
- 2018

It is proved that SGD minimizes an average potential over the posterior distribution of weights along with an entropic regularization term, and that the most likely trajectories of SGD for deep networks do not behave like Brownian motion around critical points, but resemble closed loops with deterministic components. Expand

The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning

- Computer Science, Mathematics
- ICML
- 2018

The key observation is that most modern learning architectures are over-parametrized and are trained to interpolate the data by driving the empirical loss close to zero, so it is still unclear why these interpolated solutions perform well on test data. Expand

On the Origin of Implicit Regularization in Stochastic Gradient Descent

- Computer Science, Mathematics
- ICLR
- 2021

It is proved that for SGD with random shuffling, the mean SGD iterate also stays close to the path of gradient flow if the learning rate is small and finite, but on a modified loss. Expand

A Bayesian Perspective on Generalization and Stochastic Gradient Descent

- Mathematics, Computer Science
- ICLR
- 2018

It is proposed that the noise introduced by small mini-batches drives the parameters towards minima whose evidence is large, and it is demonstrated that, when one holds the learning rate fixed, there is an optimum batch size which maximizes the test set accuracy. Expand

Sharp Minima Can Generalize For Deep Nets

- Computer Science, Mathematics
- ICML
- 2017

It is argued that most notions of flatness are problematic for deep models and can not be directly applied to explain generalization, and when focusing on deep networks with rectifier units, the particular geometry of parameter space induced by the inherent symmetries that these architectures exhibit is exploited. Expand

Train longer, generalize better: closing the generalization gap in large batch training of neural networks

- Mathematics, Computer Science
- NIPS
- 2017

This work proposes a "random walk on random landscape" statistical model which is known to exhibit similar "ultra-slow" diffusion behavior and presents a novel algorithm named "Ghost Batch Normalization" which enables significant decrease in the generalization gap without increasing the number of updates. Expand

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

- Computer Science, Mathematics
- ICLR
- 2020

The empirical results demonstrate the superior performance of LAMB across various tasks such as BERT and ResNet-50 training with very little hyperparameter tuning, and the optimizer enables use of very large batch sizes of 32868 without any degradation of performance. Expand