Soft-margin SVM Optimization and Extensions with Categorical Classification Theory

CS480/680: IntrotoML Lecture 08: Soft-margin SVM Yao-Liang Yu

Outline • Formulation • Dual • Optimization • Extension Yao-Liang Yu

Hard-margin SVM Primal Dual hard constraint Yao-Liang Yu

What if inseparable? Yao-Liang Yu

Soft-margin (Cortes & Vapnik’95) Primal Primal soft constraint hard constraint propto 1/margin training error hyper-parameter prediction (no sign) Yao-Liang Yu

Zero-one loss • Find prediction rule f so that on an unseen random X, our prediction sign(f(X)) has small chance to be different from the true label Y your prediction Yao-Liang Yu

The hinge loss upper bound zero-one Squared hinge still suffer loss! exponential loss logistic loss Yao-Liang Yu

Classification-calibration • Want to minimize zero-one loss • End up with minimizing some other loss Theorem (Bartlett, Jordan, McAuLiffe’06). Any convex margin loss ℓ is classification-calibrated iffℓ is differentiable at 0 and ℓ’(0) < 0. Classification calibration. has the same sign as , i.e., the Bayes rule. Yao-Liang Yu

Important optimization trick joint over x and t Yao-Liang Yu

Slack for “wrong” prediction Yao-Liang Yu

Lagrangian Yao-Liang Yu

Dual problem only dot product is needed! Yao-Liang Yu

The effect of C RdxR • C  0? • C  inf? Rn Yao-Liang Yu

Karush-Kuhn-Tucker conditions • Primal constraints on w, b and ξ: • Dual constraints on α and β: • Complementary slackness • Stationarity Yao-Liang Yu

Parsing the equations Yao-Liang Yu

Support Vectors Yao-Liang Yu

Recover b • Take any i such that • Then xi is on the hyperplane: • How to recover ξ ? • What if there is no such i ? Yao-Liang Yu

More examples Yao-Liang Yu

Gradient Descent • Step size (learning rate) • const., if L is smooth • diminishing, otherwise (Generalized) gradient O(nd) ! Yao-Liang Yu

Stochastic Gradient Descent (SGD) • diminishing step size, e.g., 1/sqrt{t} or 1/t • averaging, momentum, variance-reduction, etc. • sample w/o replacement; cycle; permute in each pass average over n samples a random sample suffices O(d) Yao-Liang Yu

The derivative What about perceptron? What about zero-one loss? All other losses are diff. Yao-Liang Yu

Solving the dual • Can choose constant step size ηt = η • Faster algorithms exist: e.g., choose a pair of αp and αq and derive a closed-form update O(n*n) Yao-Liang Yu

A little history on optimization • Gradient descent mentioned first in (Cauchy, 1847) • First rigorous convergence proof (Curry, 1944) • SGD proposed and analyzed (Robbins & Monro, 1951) Yao-Liang Yu

Herbert Robbins(1915 – 2001) Yao-Liang Yu

Multiclass (Crammer & Singer’01) • Soft-margin is similar • Many other variants • Calibration theory is more involved separate by a “safety margin” Prediction for wrong classes Prediction for correct class Yao-Liang Yu

Regression (Drucker et al.’97) Yao-Liang Yu

Large-scale training (You, Demmel, et al.’17) • Randomly partition training data evenly into p nodes • Train SVM independently on each node • Compute center on each node • For a test sample • Find the nearest center (node / SVM) • Predict using the corresponding node / SVM Yao-Liang Yu

Questions? Yao-Liang Yu

Soft-margin SVM Optimization and Extensions with Categorical Classification Theory