320 likes | 328 Views
CS 480/680 : Intro to ML. Lecture 0 8 : Soft-margin SVM. Outline. Formulation Dual Optimization Extension. Hard-margin SVM. Primal. Dual. hard constraint. What if in separable?. Soft-margin (Cortes & Vapnik’95). Primal. Primal. soft constraint. hard constraint. propto 1/margin.
E N D
CS480/680: IntrotoML Lecture 08: Soft-margin SVM Yao-Liang Yu
Outline • Formulation • Dual • Optimization • Extension Yao-Liang Yu
Hard-margin SVM Primal Dual hard constraint Yao-Liang Yu
What if inseparable? Yao-Liang Yu
Soft-margin (Cortes & Vapnik’95) Primal Primal soft constraint hard constraint propto 1/margin training error hyper-parameter prediction (no sign) Yao-Liang Yu
Zero-one loss • Find prediction rule f so that on an unseen random X, our prediction sign(f(X)) has small chance to be different from the true label Y your prediction Yao-Liang Yu
The hinge loss upper bound zero-one Squared hinge still suffer loss! exponential loss logistic loss Yao-Liang Yu
Classification-calibration • Want to minimize zero-one loss • End up with minimizing some other loss Theorem (Bartlett, Jordan, McAuLiffe’06). Any convex margin loss ℓ is classification-calibrated iffℓ is differentiable at 0 and ℓ’(0) < 0. Classification calibration. has the same sign as , i.e., the Bayes rule. Yao-Liang Yu
Outline • Formulation • Dual • Optimization • Extension Yao-Liang Yu
Important optimization trick joint over x and t Yao-Liang Yu
Slack for “wrong” prediction Yao-Liang Yu
Lagrangian Yao-Liang Yu
Dual problem only dot product is needed! Yao-Liang Yu
The effect of C RdxR • C 0? • C inf? Rn Yao-Liang Yu
Karush-Kuhn-Tucker conditions • Primal constraints on w, b and ξ: • Dual constraints on α and β: • Complementary slackness • Stationarity Yao-Liang Yu
Parsing the equations Yao-Liang Yu
Support Vectors Yao-Liang Yu
Recover b • Take any i such that • Then xi is on the hyperplane: • How to recover ξ ? • What if there is no such i ? Yao-Liang Yu
More examples Yao-Liang Yu
Outline • Formulation • Dual • Optimization • Extension Yao-Liang Yu
Gradient Descent • Step size (learning rate) • const., if L is smooth • diminishing, otherwise (Generalized) gradient O(nd) ! Yao-Liang Yu
Stochastic Gradient Descent (SGD) • diminishing step size, e.g., 1/sqrt{t} or 1/t • averaging, momentum, variance-reduction, etc. • sample w/o replacement; cycle; permute in each pass average over n samples a random sample suffices O(d) Yao-Liang Yu
The derivative What about perceptron? What about zero-one loss? All other losses are diff. Yao-Liang Yu
Solving the dual • Can choose constant step size ηt = η • Faster algorithms exist: e.g., choose a pair of αp and αq and derive a closed-form update O(n*n) Yao-Liang Yu
A little history on optimization • Gradient descent mentioned first in (Cauchy, 1847) • First rigorous convergence proof (Curry, 1944) • SGD proposed and analyzed (Robbins & Monro, 1951) Yao-Liang Yu
Herbert Robbins(1915 – 2001) Yao-Liang Yu
Outline • Formulation • Dual • Optimization • Extension Yao-Liang Yu
Multiclass (Crammer & Singer’01) • Soft-margin is similar • Many other variants • Calibration theory is more involved separate by a “safety margin” Prediction for wrong classes Prediction for correct class Yao-Liang Yu
Regression (Drucker et al.’97) Yao-Liang Yu
Large-scale training (You, Demmel, et al.’17) • Randomly partition training data evenly into p nodes • Train SVM independently on each node • Compute center on each node • For a test sample • Find the nearest center (node / SVM) • Predict using the corresponding node / SVM Yao-Liang Yu
Questions? Yao-Liang Yu