Introduction to SVMs

Introduction to SVMs

SVMs • Geometric • Maximizing Margin • Kernel Methods • Making nonlinear decision boundaries linear • Efficiently! • Capacity • Structural Risk Minimization

SVM History • SVM is a classifier derived from statistical learning theory by Vapnik and Chervonenkis • SVM was first introduced by Boser, Guyon and Vapnik in COLT-92 • SVM became famous when, using pixel maps as input, it gave accuracy comparable to NNs with hand-designed features in a handwriting recognition task • SVM is closely related to: • Kernel machines (a generalization of SVMs), large margin classifiers, reproducing kernel Hilbert space, Gaussian process, Boosting

Linear Classifiers a x f yest f(x,w,b) = sign(w. x- b) denotes +1 denotes -1 How would you classify this data?

Linear Classifiers a x f yest f(x,w,b) = sign(w. x- b) denotes +1 denotes -1 Any of these would be fine.. ..but which is best?

Classifier Margin a x f yest f(x,w,b) = sign(w. x- b) denotes +1 denotes -1 Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

Maximum Margin a x f yest f(x,w,b) = sign(w. x- b) denotes +1 denotes -1 The maximum margin linear classifier is the linear classifier with the maximum margin. This is the simplest kind of SVM (Called an LSVM) Linear SVM

Maximum Margin a x f yest f(x,w,b) = sign(w. x- b) denotes +1 denotes -1 The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Support Vectors are those datapoints that the margin pushes up against Linear SVM

Why Maximum Margin? • Intuitively this feels safest. • If we’ve made a small error in the location of the boundary (it’s been jolted in its perpendicular direction) this gives us least chance of causing a misclassification. • There’s some theory (using VC dimension) that is related to (but not the same as) the proposition that this is a good thing. • Empirically it works very very well. f(x,w,b) = sign(w. x- b) denotes +1 denotes -1 The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Support Vectors are those datapoints that the margin pushes up against

A “Good” Separator O O X X O X X O O X O X O X O X

Noise in the Observations O O X X O X X O O X O X O X O X

Ruling Out Some Separators O O X X O X X O O X O X O X O X

Lots of Noise O O X X O X X O O X O X O X O X

Maximizing the Margin O O X X O X X O O X O X O X O X

Specifying a line and margin • How do we represent this mathematically? • …in m input dimensions? Plus-Plane “Predict Class = +1” zone Classifier Boundary Minus-Plane “Predict Class = -1” zone

Specifying a line and margin Plus-Plane “Predict Class = +1” zone Classifier Boundary • Plus-plane = { x : w . x + b = +1 } • Minus-plane = { x : w . x + b = -1 } Minus-Plane “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1

Computing the margin width M = Margin Width “Predict Class = +1” zone • Plus-plane = { x : w . x + b = +1 } • Minus-plane = { x : w . x + b = -1 } Claim: The vector w is perpendicular to the Plus Plane. Why? How do we compute M in terms of w and b? “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1

Computing the margin width M = Margin Width “Predict Class = +1” zone • Plus-plane = { x : w . x + b = +1 } • Minus-plane = { x : w . x + b = -1 } Claim: The vector w is perpendicular to the Plus Plane. Why? How do we compute M in terms of w and b? “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 Let u and v be two vectors on the Plus Plane. What is w . ( u – v ) ? And so of course the vector w is also perpendicular to the Minus Plane

Computing the margin width Any location in m: not necessarily a datapoint Any location in Rm: not necessarily a datapoint M = Margin Width x+ “Predict Class = +1” zone • Plus-plane = { x : w . x + b = +1 } • Minus-plane = { x : w . x + b = -1 } • The vector w is perpendicular to the Plus Plane • Let x- be any point on the minus plane • Let x+ be the closest plus-plane-point to x-. How do we compute M in terms of w and b? x- “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1

Computing the margin width M = Margin Width x+ “Predict Class = +1” zone • Plus-plane = { x : w . x + b = +1 } • Minus-plane = { x : w . x + b = -1 } • The vector w is perpendicular to the Plus Plane • Let x- be any point on the minus plane • Let x+ be the closest plus-plane-point to x-. • Claim: x+ = x- + lw for some value of l. Why? How do we compute M in terms of w and b? x- “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1

Computing the margin width M = Margin Width x+ “Predict Class = +1” zone The line from x- to x+ is perpendicular to the planes. So to get from x- to x+ travel some distance in direction w. • Plus-plane = { x : w . x + b = +1 } • Minus-plane = { x : w . x + b = -1 } • The vector w is perpendicular to the Plus Plane • Let x- be any point on the minus plane • Let x+ be the closest plus-plane-point to x-. • Claim: x+ = x- + lw for some value of l. Why? How do we compute M in terms of w and b? x- “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1

Computing the margin width M = Margin Width x+ “Predict Class = +1” zone What we know: • w . x+ + b = +1 • w . x- + b = -1 • x+ = x- + lw • |x+ - x- | = M It’s now easy to get M in terms of w and b x- “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1

Computing the margin width M = Margin Width x+ “Predict Class = +1” zone What we know: • w . x+ + b = +1 • w . x- + b = -1 • x+ = x- + lw • |x+ - x- | = M It’s now easy to get M in terms of w and b x- “Predict Class = -1” zone wx+b=1 w . (x - + lw) + b = 1 => w . x -+ b + lw .w = 1 => -1 + lw .w = 1 => wx+b=0 wx+b=-1

Computing the margin width M = Margin Width = x+ “Predict Class = +1” zone What we know: • w . x+ + b = +1 • w . x- + b = -1 • x+ = x- + lw • |x+ - x- | = M x- “Predict Class = -1” zone wx+b=1 M = |x+ - x- | =| lw |= wx+b=0 wx+b=-1

Learning the Maximum Margin Classifier M = Margin Width = x+ “Predict Class = +1” zone Given a guess of w and b we can • Compute whether all data points in the correct half-planes • Compute the width of the margin So now we just need to write a program to search the space of w’s and b’s to find the widest margin that matches all the datapoints. How? Gradient descent? Simulated Annealing? Matrix Inversion? EM? Newton’s Method? x- “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1

Don’t worry… it’s good for you… • Linear Programming find w argmax cw subject to wai  bi, for i = 1, …, m wj  0 for j = 1, …, n There are fast algorithms for solving linear programs including the simplex algorithm and Karmarkar’s algorithm

Learning via Quadratic Programming • QP is a well-studied class of optimization algorithms to maximize a quadratic function of some real-valued variables subject to linear constraints.

Quadratic Programming Quadratic criterion Find Subject to n additional linear inequality constraints And subject to e additional linear equality constraints

Quadratic Programming Quadratic criterion Find There exist algorithms for finding such constrained quadratic optima much more efficiently and reliably than gradient ascent. (But they are very fiddly…you probably don’t want to write one yourself) Subject to n additional linear inequality constraints And subject to e additional linear equality constraints

Learning the Maximum Margin Classifier M = “Predict Class = +1” zone Given guess of w , b we can • Compute whether all data points are in the correct half-planes • Compute the margin width Assume R datapoints, each (xk,yk) where yk = +/- 1 “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 What should our quadratic optimization criterion be? How many constraints will we have? What should they be? R w . xk + b >= 1 if yk = 1 w . xk + b <= -1 if yk = -1 Minimizew.w

Uh-oh! denotes +1 denotes -1 • This is going to be a problem! • What should we do? • Idea 1: • Find minimum w.w, while minimizing number of training set errors. • Problem: Two things to minimize makes for an ill-defined optimization

Uh-oh! denotes +1 denotes -1 • This is going to be a problem! • What should we do? • Idea 1.1: • Minimize • w.w+ C (#train errors) • There’s a serious practical problem that’s about to make us reject this approach. Can you guess what it is? Tradeoff parameter

Uh-oh! denotes +1 denotes -1 • This is going to be a problem! • What should we do? • Idea 1.1: • Minimize • w.w+ C (#train errors) • There’s a serious practical problem that’s about to make us reject this approach. Can you guess what it is? Tradeoff parameter Can’t be expressed as a Quadratic Programming problem. Solving it may be too slow. (Also, doesn’t distinguish between disastrous errors and near misses) So… any other ideas?

Uh-oh! denotes +1 denotes -1 • This is going to be a problem! • What should we do? • Idea 2.0: • Minimize • w.w+ C (distance of error • points to their • correct place)

Learning Maximum Margin with Noise M = Given guess of w , b we can • Compute sum of distances of points to their correct zones • Compute the margin width Assume R datapoints, each (xk,yk) where yk = +/- 1 wx+b=1 wx+b=0 wx+b=-1 What should our quadratic optimization criterion be? How many constraints will we have? What should they be?

Large-margin Decision Boundary • The decision boundary should be as far away from the data of both classes as possible • We should maximize the margin, m • Distance between the origin and the line wtx=k is k/||w|| Class 2 m Class 1

Finding the Decision Boundary • Let {x1, ..., xn} be our data set and let yiÎ {1,-1} be the class label of xi • The decision boundary should classify all points correctly Þ • The decision boundary can be found by solving the following constrained optimization problem • This is a constrained optimization problem. Solving it requires some new tools • Feel free to ignore the following several slides; what is important is the constrained optimization problem above

Back to the Original Problem • The Lagrangian is • Note that ||w||2 = wTw • Setting the gradient of w.r.t. w and b to zero, we have

The Karush-Kuhn-Tucker conditions,

The Dual Problem • If we substitute to , we have • Note that • This is a function of ai only

The Dual Problem • The new objective function is in terms of ai only • It is known as the dual problem: if we know w, we know all ai; if we know all ai, we know w • The original problem is known as the primal problem • The objective function of the dual problem needs to be maximized! • The dual problem is therefore: Properties of ai when we introduce the Lagrange multipliers The result when we differentiate the original Lagrangian w.r.t. b

The Dual Problem • This is a quadratic programming (QP) problem • A global maximum of ai can always be found • w can be recovered by

Characteristics of the Solution • Many of the ai are zero • w is a linear combination of a small number of data points • This “sparse” representation can be viewed as data compression as in the construction of knn classifier • xi with non-zero ai are called support vectors (SV) • The decision boundary is determined only by the SV • Let tj (j=1, ..., s) be the indices of the s support vectors. We can write • For testing with a new data z • Compute and classify z as class 1 if the sum is positive, and class 2 otherwise • Note: w need not be formed explicitly

A Geometrical Interpretation Class 2 a10=0 a8=0.6 a7=0 a2=0 a5=0 a1=0.8 a4=0 a6=1.4 a9=0 a3=0 Class 1

Non-linearly Separable Problems Class 2 Class 1 • We allow “error” xi in classification; it is based on the output of the discriminant function wTx+b • xi approximates the number of misclassified samples

Learning Maximum Margin with Noise M = e11 e2 Given guess of w , b we can • Compute sum of distances of points to their correct zones • Compute the margin width Assume R datapoints, each (xk,yk) where yk = +/- 1 wx+b=1 e7 wx+b=0 wx+b=-1 What should our quadratic optimization criterion be? Minimize How many constraints will we have? R What should they be? w . xk + b >= 1-ek if yk = 1 w . xk + b <= -1+ek if yk = -1

Learning Maximum Margin with Noise m = # input dimensions M = e11 e2 Given guess of w , b we can • Compute sum of distances of points to their correct zones • Compute the margin width Assume R datapoints, each (xk,yk) where yk = +/- 1 Our original (noiseless data) QP had m+1 variables: w1, w2, … wm, and b. Our new (noisy data) QP has m+1+R variables: w1, w2, … wm, b, ek , e1 ,… eR wx+b=1 e7 wx+b=0 wx+b=-1 What should our quadratic optimization criterion be? Minimize How many constraints will we have? R What should they be? w . xk + b >= 1-ek if yk = 1 w . xk + b <= -1+ek if yk = -1 R= # records

Introduction to SVMs