1 / 45

Softmax Classifier

Softmax Classifier. Today’s Class. Softmax Classifier Inference / Making Predictions / Test Time Training a Softmax Classifier Stochastic Gradient Descent (SGD). Supervised Learning - Classification. Test Data. Training Data. cat. dog. cat. bear.

tammyr
Download Presentation

Softmax Classifier

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Softmax Classifier

  2. Today’s Class • Softmax Classifier • Inference / Making Predictions / Test Time • Training a Softmax Classifier • Stochastic Gradient Descent (SGD)

  3. Supervised Learning - Classification Test Data Training Data cat dog cat . . . . . . bear

  4. Supervised Learning - Classification Training Data cat dog cat . . . bear

  5. Supervised Learning - Classification Training Data targets / labels / ground truth We need to finda function that mapsx and y for any of them. predictions inputs 1 1 2 2 1 2 How do we ”learn” the parameters of this function? . . . We choose ones that makes the following quantity small: 3 1

  6. Supervised Learning – Linear Softmax Training Data targets / labels / ground truth inputs 1 2 1 . . . 3

  7. Supervised Learning – Linear Softmax Training Data targets / labels / ground truth predictions inputs [1 0 0] [0.85 0.10 0.05] [0 1 0] [0.20 0.70 0.10] [1 0 0] [0.40 0.45 0.15] . . . [0 0 1] [0.40 0.25 0.35]

  8. Supervised Learning – Linear Softmax [1 0 0]

  9. How do we find a good w and b? [1 0 0] We need to find w, and b that minimize the following: Why?

  10. Gradient Descent (GD) Initialize w and b randomly for e = 0, num_epochsdo Compute: and Update w: Update b: // Useful to see if this is becoming smaller or not. Print: end

  11. Gradient Descent (GD) (idea) 1. Start with a random value of w (e.g. w = 12) 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) 3. Recompute w as: w = w – lambda * (dL / dw) w=12

  12. Gradient Descent (GD) (idea) 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) 3. Recompute w as: w = w – lambda * (dL / dw) w=10

  13. Gradient Descent (GD) (idea) 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) 3. Recompute w as: w = w – lambda * (dL / dw) w=8

  14. Our function L(w)

  15. Our function L(w)

  16. Our function L(w) L

  17. Gradient Descent (GD) expensive Initialize w and b randomly for e = 0, num_epochsdo Compute: and Update w: Update b: // Useful to see if this is becoming smaller or not. Print: end

  18. (mini-batch) Stochastic Gradient Descent (SGD) Initialize w and b randomly for e = 0, num_epochsdo for b = 0, num_batchesdo Compute: and Update w: Update b: // Useful to see if this is becoming smaller or not. Print: end end

  19. Source: Andrew Ng

  20. (mini-batch) Stochastic Gradient Descent (SGD) Initialize w and b randomly for e = 0, num_epochsdo for b = 0, num_batchesdo Compute: and for |B| = 1 Update w: Update b: // Useful to see if this is becoming smaller or not. Print: end end

  21. Computing Analytic Gradients This is what we have:

  22. Computing Analytic Gradients This is what we have: Reminder:

  23. Computing Analytic Gradients This is what we have:

  24. Computing Analytic Gradients This is what we have: This is what we need: for each for each

  25. Computing Analytic Gradients This is what we have: Step 1: Chain Rule of Calculus

  26. Computing Analytic Gradients This is what we have: Step 1: Chain Rule of Calculus Let’s do these first

  27. Computing Analytic Gradients

  28. Computing Analytic Gradients

  29. Computing Analytic Gradients

  30. Computing Analytic Gradients This is what we have: Step 1: Chain Rule of Calculus Now let’s do this one (same for both!)

  31. Computing Analytic Gradients In our cat, dog, bear classification example: i = {0, 1, 2}

  32. Computing Analytic Gradients In our cat, dog, bear classification example: i = {0, 1, 2} Let’s say: label = 1 We need:

  33. Computing Analytic Gradients

  34. Remember this slide? [1 0 0]

  35. Computing Analytic Gradients

  36. Computing Analytic Gradients

  37. Computing Analytic Gradients label = 1 =

  38. Computing Analytic Gradients

  39. Supervised Learning –Softmax Classifier Extract features Run features through classifier Get predictions

  40. Overfitting is a polynomial of degree 9 is linear is cubic is high is low is zero! Overfitting Underfitting High Bias High Variance Credit: C. Bishop. Pattern Recognition and Mach. Learning.

  41. More … • Regularization • Momentum updates • Hinge Loss, Least Squares Loss, Logistic Regression Loss

  42. Assignment 2 – Linear Margin-Classifier Training Data targets / labels / ground truth predictions inputs [1 0 0] [4.3 -1.3 1.1] [0 1 0] [0.5 5.6 -4.2] [1 0 0] [3.3 3.5 1.1] . . . [0 0 1] [1.1 -5.3 -9.4]

  43. Supervised Learning – Linear Softmax [1 0 0]

  44. How do we find a good w and b? [1 0 0] We need to find w, and b that minimize the following: Why?

  45. Questions?

More Related