Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD)

Today’s Class • Stochastic Gradient Descent (SGD) • SGD Recap • Regression vs Classification • Generalization / Overfitting / Underfitting • Regularization • Momentum Updates / ADAM Updates

Our function L(w)

Our function L(w) Easy way to find minimum (and max): Find where This is zero when:

Our function L(w) But this is not easy for complex functions: L

Our function L(w) Or even for simpler functions: How do you find x?

Gradient Descent (GD) (idea) 1. Start with a random value of w (e.g. w = 12) 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) 3. Recompute w as: w = w – lambda * (dL / dw) w=12

Gradient Descent (GD) (idea) 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) 3. Recompute w as: w = w – lambda * (dL / dw) w=10

Gradient Descent (GD) (idea) 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) 3. Recompute w as: w = w – lambda * (dL / dw) w=8

Gradient Descent (GD) Initialize w and b randomly for e = 0, num_epochsdo Compute: and Update w: Update b: // Useful to see if this is becoming smaller or not. Print: end

Gradient Descent (GD) expensive Initialize w and b randomly for e = 0, num_epochsdo Compute: and Update w: Update b: // Useful to see if this is becoming smaller or not. Print: end

(mini-batch) Stochastic Gradient Descent (SGD) Initialize w and b randomly for e = 0, num_epochsdo for b = 0, num_batchesdo Compute: and Update w: Update b: // Useful to see if this is becoming smaller or not. Print: end end

(mini-batch) Stochastic Gradient Descent (SGD) Initialize w and b randomly for e = 0, num_epochsdo for b = 0, num_batchesdo Compute: and for |B| = 1 Update w: Update b: // Useful to see if this is becoming smaller or not. Print: end end

Regression vs Classification • Regression • Labels are continuous variables – e.g. distance. • Losses: Distance-based losses, e.g. sum of distances to true values. • Evaluation: Mean distances, correlation coefficients, etc. • Classification • Labels are discrete variables (1 out of K categories) • Losses: Cross-entropy loss, margin losses, logistic regression (binary cross entropy) • Evaluation: Classification accuracy, etc.

Linear Regression – 1 output, 1 input

Linear Regression – 1 output, 1 input Model:

Linear Regression – 1 output, 1 input Loss: Model:

Quadratic Regression Loss: Model:

n-polynomial Regression Model: Loss:

Overfitting is a polynomial of degree 9 is linear is cubic is high is low is zero! Overfitting Underfitting Christopher M. Bishop – Pattern Recognition and Machine Learning High Bias High Variance Credit: C. Bishop. Pattern Recognition and Mach. Learning.

Regularization • Large weights lead to large variance. i.e. model fits to the training data too strongly. • Solution: Minimize the loss but also try to keep the weight values small by doing the following: minimize Credit: C. Bishop. Pattern Recognition and Mach. Learning.

Regularization • Large weights lead to large variance. i.e. model fits to the training data too strongly. • Solution: Minimize the loss but also try to keep the weight values small by doing the following: minimize Regularizer term e.g. L2- regularizer Credit: C. Bishop. Pattern Recognition and Mach. Learning.

SGD with Regularization (L-2) Initialize w and b randomly for e = 0, num_epochsdo for b = 0, num_batchesdo Compute: and Update w: Update b: // Useful to see if this is becoming smaller or not. Print: end end

Revisiting Another Problem with SGD Initialize w and b randomly for e = 0, num_epochsdo These are only approximations to the true gradient with respect to for b = 0, num_batchesdo Compute: and Update w: Update b: // Useful to see if this is becoming smaller or not. Print: end end

Revisiting Another Problem with SGD Initialize w and b randomly for e = 0, num_epochsdo This could lead to “un-learning” what has been learned in some previous steps of training. for b = 0, num_batchesdo Compute: and Update w: Update b: // Useful to see if this is becoming smaller or not. Print: end end

Solution: Momentum Updates Initialize w and b randomly for e = 0, num_epochsdo Keep track of previous gradients in an accumulator variable! and use a weighted average with current gradient. for b = 0, num_batchesdo Compute: and Update w: Update b: // Useful to see if this is becoming smaller or not. Print: end end

Solution: Momentum Updates Initialize w and b randomly global for e = 0, num_epochsdo Keep track of previous gradients in an accumulator variable! and use a weighted average with current gradient. for b = 0, num_batchesdo Compute: Compute: Update w: // Useful to see if this is becoming smaller or not. Print: end end

More on Momentum https://distill.pub/2017/momentum/

Questions?

Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD)

Presentation Transcript

Online convex optimization Gradient descent without a gradient

Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty

Blind online optimization Gradient descent without a gradient

Semi-Stochastic Gradient Descent Methods

Semi-Stochastic Gradient Descent Methods

Gradient descent

Objectives: Linear Discriminant Functions Gradient Descent Nonseparable Data

Gradient Descent with Numpy

Better Data Assimilation through Gradient Descent

Stochastic Gradient Descent and Tree Parameterizations in SLAM

Gradient Descent Rule Tuning

Gradient descent assimilation for the point vortex model

Efficient Logistic Regression with Stochastic Gradient Descent: The Continuing Saga

Efficient Logistic Regression with Stochastic Gradient Descent

The Gradient Descent Algorithm

Efficient Logistic Regression with Stochastic Gradient Descent – part 2

Efficient Logistic Regression with Stochastic Gradient Descent

Convert Singapore Dollar to Indian Rupees

Gradient Descent

Better Data Assimilation through Gradient Descent

ECE471-571 – Pattern Recognition Lecture 14 – Gradient Descent