1 / 30

Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD). Today’s Class. Stochastic Gradient Descent (SGD) SGD Recap Regression vs Classification Generalization / Overfitting / Underfitting Regularization Momentum Updates / ADAM Updates. Our function L(w). Our function L(w).

anns
Download Presentation

Stochastic Gradient Descent (SGD)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Stochastic Gradient Descent (SGD)

  2. Today’s Class • Stochastic Gradient Descent (SGD) • SGD Recap • Regression vs Classification • Generalization / Overfitting / Underfitting • Regularization • Momentum Updates / ADAM Updates

  3. Our function L(w)

  4. Our function L(w) Easy way to find minimum (and max): Find where This is zero when:

  5. Our function L(w) But this is not easy for complex functions: L

  6. Our function L(w) Or even for simpler functions: How do you find x?

  7. Gradient Descent (GD) (idea) 1. Start with a random value of w (e.g. w = 12) 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) 3. Recompute w as: w = w – lambda * (dL / dw) w=12

  8. Gradient Descent (GD) (idea) 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) 3. Recompute w as: w = w – lambda * (dL / dw) w=10

  9. Gradient Descent (GD) (idea) 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) 3. Recompute w as: w = w – lambda * (dL / dw) w=8

  10. Gradient Descent (GD) Initialize w and b randomly for e = 0, num_epochsdo Compute: and Update w: Update b: // Useful to see if this is becoming smaller or not. Print: end

  11. Gradient Descent (GD) expensive Initialize w and b randomly for e = 0, num_epochsdo Compute: and Update w: Update b: // Useful to see if this is becoming smaller or not. Print: end

  12. (mini-batch) Stochastic Gradient Descent (SGD) Initialize w and b randomly for e = 0, num_epochsdo for b = 0, num_batchesdo Compute: and Update w: Update b: // Useful to see if this is becoming smaller or not. Print: end end

  13. (mini-batch) Stochastic Gradient Descent (SGD) Initialize w and b randomly for e = 0, num_epochsdo for b = 0, num_batchesdo Compute: and for |B| = 1 Update w: Update b: // Useful to see if this is becoming smaller or not. Print: end end

  14. Regression vs Classification • Regression • Labels are continuous variables – e.g. distance. • Losses: Distance-based losses, e.g. sum of distances to true values. • Evaluation: Mean distances, correlation coefficients, etc. • Classification • Labels are discrete variables (1 out of K categories) • Losses: Cross-entropy loss, margin losses, logistic regression (binary cross entropy) • Evaluation: Classification accuracy, etc.

  15. Linear Regression – 1 output, 1 input

  16. Linear Regression – 1 output, 1 input Model:

  17. Linear Regression – 1 output, 1 input Model:

  18. Linear Regression – 1 output, 1 input Loss: Model:

  19. Quadratic Regression Loss: Model:

  20. n-polynomial Regression Model: Loss:

  21. Overfitting is a polynomial of degree 9 is linear is cubic is high is low is zero! Overfitting Underfitting Christopher M. Bishop – Pattern Recognition and Machine Learning High Bias High Variance Credit: C. Bishop. Pattern Recognition and Mach. Learning.

  22. Regularization • Large weights lead to large variance. i.e. model fits to the training data too strongly. • Solution: Minimize the loss but also try to keep the weight values small by doing the following: minimize Credit: C. Bishop. Pattern Recognition and Mach. Learning.

  23. Regularization • Large weights lead to large variance. i.e. model fits to the training data too strongly. • Solution: Minimize the loss but also try to keep the weight values small by doing the following: minimize Regularizer term e.g. L2- regularizer Credit: C. Bishop. Pattern Recognition and Mach. Learning.

  24. SGD with Regularization (L-2) Initialize w and b randomly for e = 0, num_epochsdo for b = 0, num_batchesdo Compute: and Update w: Update b: // Useful to see if this is becoming smaller or not. Print: end end

  25. Revisiting Another Problem with SGD Initialize w and b randomly for e = 0, num_epochsdo These are only approximations to the true gradient with respect to for b = 0, num_batchesdo Compute: and Update w: Update b: // Useful to see if this is becoming smaller or not. Print: end end

  26. Revisiting Another Problem with SGD Initialize w and b randomly for e = 0, num_epochsdo This could lead to “un-learning” what has been learned in some previous steps of training. for b = 0, num_batchesdo Compute: and Update w: Update b: // Useful to see if this is becoming smaller or not. Print: end end

  27. Solution: Momentum Updates Initialize w and b randomly for e = 0, num_epochsdo Keep track of previous gradients in an accumulator variable! and use a weighted average with current gradient. for b = 0, num_batchesdo Compute: and Update w: Update b: // Useful to see if this is becoming smaller or not. Print: end end

  28. Solution: Momentum Updates Initialize w and b randomly global for e = 0, num_epochsdo Keep track of previous gradients in an accumulator variable! and use a weighted average with current gradient. for b = 0, num_batchesdo Compute: Compute: Update w: // Useful to see if this is becoming smaller or not. Print: end end

  29. More on Momentum https://distill.pub/2017/momentum/

  30. Questions?

More Related