1 / 31

CS 480/680 : Intro to ML

CS 480/680 : Intro to ML. Lecture 0 8 : Soft-margin SVM. Outline. Formulation Dual Optimization Extension. Hard-margin SVM. Primal. Dual. hard constraint. What if in separable?. Soft-margin (Cortes & Vapnik’95). Primal. Primal. soft constraint. hard constraint. propto 1/margin.

kuhlman
Download Presentation

CS 480/680 : Intro to ML

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS480/680: IntrotoML Lecture 08: Soft-margin SVM Yao-Liang Yu

  2. Outline • Formulation • Dual • Optimization • Extension Yao-Liang Yu

  3. Hard-margin SVM Primal Dual hard constraint Yao-Liang Yu

  4. What if inseparable? Yao-Liang Yu

  5. Soft-margin (Cortes & Vapnik’95) Primal Primal soft constraint hard constraint propto 1/margin training error hyper-parameter prediction (no sign) Yao-Liang Yu

  6. Zero-one loss • Find prediction rule f so that on an unseen random X, our prediction sign(f(X)) has small chance to be different from the true label Y your prediction Yao-Liang Yu

  7. The hinge loss upper bound zero-one Squared hinge still suffer loss! exponential loss logistic loss Yao-Liang Yu

  8. Classification-calibration • Want to minimize zero-one loss • End up with minimizing some other loss Theorem (Bartlett, Jordan, McAuLiffe’06). Any convex margin loss ℓ is classification-calibrated iffℓ is differentiable at 0 and ℓ’(0) < 0. Classification calibration. has the same sign as , i.e., the Bayes rule. Yao-Liang Yu

  9. Outline • Formulation • Dual • Optimization • Extension Yao-Liang Yu

  10. Important optimization trick joint over x and t Yao-Liang Yu

  11. Slack for “wrong” prediction Yao-Liang Yu

  12. Lagrangian Yao-Liang Yu

  13. Dual problem only dot product is needed! Yao-Liang Yu

  14. The effect of C RdxR • C  0? • C  inf? Rn Yao-Liang Yu

  15. Karush-Kuhn-Tucker conditions • Primal constraints on w, b and ξ: • Dual constraints on α and β: • Complementary slackness • Stationarity Yao-Liang Yu

  16. Parsing the equations Yao-Liang Yu

  17. Support Vectors Yao-Liang Yu

  18. Recover b • Take any i such that • Then xi is on the hyperplane: • How to recover ξ ? • What if there is no such i ? Yao-Liang Yu

  19. More examples Yao-Liang Yu

  20. Outline • Formulation • Dual • Optimization • Extension Yao-Liang Yu

  21. Gradient Descent • Step size (learning rate) • const., if L is smooth • diminishing, otherwise (Generalized) gradient O(nd) ! Yao-Liang Yu

  22. Stochastic Gradient Descent (SGD) • diminishing step size, e.g., 1/sqrt{t} or 1/t • averaging, momentum, variance-reduction, etc. • sample w/o replacement; cycle; permute in each pass average over n samples a random sample suffices O(d) Yao-Liang Yu

  23. The derivative What about perceptron? What about zero-one loss? All other losses are diff. Yao-Liang Yu

  24. Solving the dual • Can choose constant step size ηt = η • Faster algorithms exist: e.g., choose a pair of αp and αq and derive a closed-form update O(n*n) Yao-Liang Yu

  25. A little history on optimization • Gradient descent mentioned first in (Cauchy, 1847) • First rigorous convergence proof (Curry, 1944) • SGD proposed and analyzed (Robbins & Monro, 1951) Yao-Liang Yu

  26. Herbert Robbins(1915 – 2001) Yao-Liang Yu

  27. Outline • Formulation • Dual • Optimization • Extension Yao-Liang Yu

  28. Multiclass (Crammer & Singer’01) • Soft-margin is similar • Many other variants • Calibration theory is more involved separate by a “safety margin” Prediction for wrong classes Prediction for correct class Yao-Liang Yu

  29. Regression (Drucker et al.’97) Yao-Liang Yu

  30. Large-scale training (You, Demmel, et al.’17) • Randomly partition training data evenly into p nodes • Train SVM independently on each node • Compute center on each node • For a test sample • Find the nearest center (node / SVM) • Predict using the corresponding node / SVM Yao-Liang Yu

  31. Questions? Yao-Liang Yu

More Related