A Reassuring Introduction to Support Vector Machines

A Reassuring Introduction to Support Vector Machines • Mark Stamp A Reassuring Introduction to SVM

Supervised vs Unsupervised A Reassuring Introduction to SVM • Often use supervised learning… • …where training relies on labeled data • Training data must be pre-processed • In contrast, unsupervised learning… • …uses unlabeled data • No pre-processing required for training • Also semi-supervised algorithms • Supervised, but not too much?

HMM for Supervised Learning A Reassuring Introduction to SVM • Suppose we want to use HMM for malware detection • Train model on set of malware • All from one specific family • Data labeled as malware of that type • Test to see how well it distinguishes malware from benign • This is supervised learning

Unsupervised Learning? A Reassuring Introduction to SVM • Recall HMM for English text example • Using N = 2, we find hidden states correspond to consonants and vowels • We did not specify consonants/vowels • HMM extracted this info from raw data • Unsupervised or semi-supervised? • It seems to depend on your definition

Unsupervised Learning A Reassuring Introduction to SVM • Clustering • Good example of unsupervised learning • Other examples? • For “mixed” dataset, often the goal of clustering is to reveal structure • No pre-processing • Often no idea how to pre-process • Usually used in “data exploration” mode

Supervised Learning A Reassuring Introduction to SVM • SVM one of the most popular supervised learning method • Also, HMM, PHMM, PCA, ANN, etc., used for supervised learning • SVM is for binary classification • I.e., 2 classes, such as malware vs benign • SVM generalizes to multiple classes • As does LDA and some other techniques

Support Vector Machine A Reassuring Introduction to SVM • According to another author… • “SVMs are a rare example of a methodology where geometric intuition, elegant mathematics, theoretical guarantees, and practical algorithms meet” • We have something to say about each aspect of this… • Geometry, math, theory, and algorithms

Support Vector Machine A Reassuring Introduction to SVM • SVM based on four BIG ideas • Separating hyperplane • Maximize the “margin” • Maximize minimum separation between classes • Work in a higher dimensional space • More “room”, so easier to separate • Kernel trick • This is intimately related to 3 • Both 1 and 2 are fairly intuitive

SVM A Reassuring Introduction to SVM • SVMs can apply to any training data • Note that SVM yields classification… • … not a score, per se • With HMM, for example • We first train a model… • …then generate scores and set threshold • SVM directly gives classification • Skip the intermediate (testing) step

Separating Classes A Reassuring Introduction to SVM • Consider labeled data • Binary classifier • Red class is type “1” • Blue class is “-1” • And (x,y) are features • How to separate? • We’ll use a “hyperplane”… • …a line in this case

Separating Hyperplanes A Reassuring Introduction to SVM • Consider labeled data • Here, easy to separate • Draw a hyperplane to separate points • Classify new data based on separating hyperplane • Which hyperplane is better? Or best? Why?

Maximize Margin A Reassuring Introduction to SVM • Margin is min distance to misclassification • Maximize the margin • Yellow hyperplane is better than purple • Seems like a good idea • But, may not be possible • See next slide…

Separating… NOT A Reassuring Introduction to SVM • What about this case? • Yellow line not an option • Why not? • No longer “separating” • What to do? • Allow for some errors? • E.g., hyperplane need not completely separate

Soft Margin A Reassuring Introduction to SVM • Ideally, large margin and no errors • But allowing some misclassifications might increase the margin by a lot • I.e., relax “separating” requirement • How many errors to allow? • Let it be a user defined parameter • Tradeoff?Errors vs larger margin • In practice, can use trial and error

Feature Space A Reassuring Introduction to SVM • Transform data to “feature space” • Feature space in higher dimension • But what about curse of dimensionality? • Q: Why increase dimensionality??? • A: Easier to separate in feature space • Goal is to make data “linearly separable” • Want to separate classes with hyperplane • But not pay a price for high dimensionality

Input Space & Feature Space ϕ Input space Feature space A Reassuring Introduction to SVM • Why transform? • Sometimes nonlinear can become linear…

Feature Space in Higher Dimension A Reassuring Introduction to SVM An example of what can happen when transforming to a higher dimension

Feature Space A Reassuring Introduction to SVM • Usually, higher dimension is worse • From computational complexity POV… • ...and from statistical significance POV • But higher dimensional feature space can make data linearly separable • Can we have our cake and eat it too? • Linearly separable and easy to compute? • Yes! Thanks to the kernel trick

Kernel Trick A Reassuring Introduction to SVM • Enables us to work in input space • With results mapped to feature space • No work done explicitly in feature space • Computations in input space • Lower dimension, so computation easier • But, things “happen” in feature space • Higher dimension, so easier to separate • Very, very cool trick!

Kernel Trick A Reassuring Introduction to SVM • Unfortunately, to understand kernel trick, must dig a little (a lot?) deeper • Makes all aspects of SVM clearer • We won’t cover every detail here • Just enough to get idea across • Well, maybe a little more than that… • We’ll need Lagrange multipliers • But first, constrained optimization

Constrained Optimization A Reassuring Introduction to SVM • General problem (in 2 variables) • Maximize: f(x,y) • Subject to: g(x,y) = c • Objective functionf and constraintg • For example, • Maximize: f(x,y) = 16 – (x2 + y2) • Subject to: 2x – y = 4 • We’ll look at this example in detail

Specific Example A Reassuring Introduction to SVM Maximize: f(x,y) = 16 – (x2 + y2) Subject to: 2x – y = 4 Graph of f(x,y)

Intersection A Reassuring Introduction to SVM Intersection of f(x,y) and 2x – y = 4 What is the solution to problem?

Constrained Optimization A Reassuring Introduction to SVM • This example looks easy • But how to solve in general? • Recall, general case (in 2 variables) is • Maximize: f(x,y) • Subject to: g(x,y) = c • How to “simplify”? • Combine objective function f(x,y) and constraint g(x,y) = c into one equation!

Proposed Solution A Reassuring Introduction to SVM • Define J(x,y) = f(x,y) + I(x,y) • Where I(x,y)is 0 whenever g(x,y) = c and -∞otherwise • Recall the general problem… • Maximize: f(x,y) • Subject to: g(x,y) = c • Solution is given by max J(x,y) • Here, max is over (x,y)

Proposed Solution A Reassuring Introduction to SVM • We know how to solve maximization problems using calculus • So, we’ll use calculus to solve the problem max J(x,y), right? • WRONG! • The function J(x,y) is not at all “nice” • This function is not differentiable • It’s not even continuous!

Proposed Solution A Reassuring Introduction to SVM • Again, let J(x,y) = f(x,y) + I(x,y) • Where I(x,y) is 0 whenever g(x,y) = c and -∞ otherwise • Then max J(x,y) is solution to problem • This is good • But we can’t solve this max problem • This is very bad • What to do???

New-and-Improved Solution A Reassuring Introduction to SVM • Let’s replace I(x,y) with a nice function • What are the nicest functions of all? • Linear function(in the constraint) • To maximize f(x,y), subject to g(x,y) = cwe first define the Lagrangian L(x,y,λ) = f(x,y) + λ(g(x,y) – c) • Nice function in λ, so calculus applies • But, not just a max problem (next slide…)

New-and-Improved Solution A Reassuring Introduction to SVM • Maximize: f(x,y), subject to: g(x,y) = c • Again, the Lagrangian is L(x,y,λ) = f(x,y) + λ(g(x,y) – c) • Observe that min L(x,y,λ) = J(x,y) • Where min is over λ • Recall that max J(x,y) solves problem • So max min L(x,y,λ)also solves problem • Advantage of this form of problem ?

Lagrange Multipliers A Reassuring Introduction to SVM • Maximize: f(x,y), subject to: g(x,y) = c • Lagrangian:L(x,y,λ)=f(x,y)+λ(g(x,y)–c) • Solution given by max min L(x,y,λ) • Note this is maxwrt(x,y) variables… • ...and min is wrt λ parameter • So, solution is at a “saddle point” wrt overall function, i.e., (x,y,λ) variables • By definition of a saddle point

Saddle Points A Reassuring Introduction to SVM • Graph of L(x,λ) = 4-x2 +λ(x-1) • Note, f(x) = 4-x2 and constraint is x=1

New-and-Improved Solution A Reassuring Introduction to SVM Maximize: f(x,y), subject to: g(x,y) = c Lagrangian isL(x,y,λ)=f(x,y)+λ(g(x,y)–c) Solved by max min L(x,y,λ) Calculus to the rescue! And which implies g(x,y) = c Langrangian: Constrained optimization converted to unconstrained optimization

More, More, More A Reassuring Introduction to SVM • Lagrangian generalizes to more variables and/or more constraints • Or, more succinctly • Where x=(x1,x2,…,xn) and λ=(λ1, λ2,…,λm)

Another Example A Reassuring Introduction to SVM • Lots of good geometric examples • First, we do a non-geometric example • Consider discrete probability distribution on n points: p1,p2,p3,…,pn • What distribution has max entropy? • We want to maximize entropy function • Subject to constraint that the pj form a probability distribution

Maximize Entropy A Reassuring Introduction to SVM • Shannon entropy: Σpj log2pj • Have a probability distribution, so… • Require 0 ≤ pj ≤ 1 for all j, and Σpj = 1 • We will solve this simplified problem: • Maximize: f(p1,..,pn) = Σpj log2pj • Subject to constraint: Σpj = 1 • How should we solve this? • Do you really have to ask?

Entropy Example A Reassuring Introduction to SVM • Recall L(x,y,λ) = f(x,y) + λ (g(x,y) – c) • Problem statement • Maximize f(p1,..,pn) = Σpj log2pj • Subject to constraint Σpj = 1 • In this case, Lagrangian is L(p1,…,pn,λ) = Σpj log2pj + λ (Σpj – 1) • Compute partial derivatives wrt each pj and partial derivative wrtλ

Entropy Example A Reassuring Introduction to SVM • Have L(x,y,λ) = Σpj log2pj + λ (Σpj – 1) • Partial derivatives wrt any pj yields log2pj + 1/ln(2) + λ = 0 (#) • And wrtλ yields the constraint Σpj – 1 = 0 orΣpj = 1 (##) • Equation (#) implies all pj are equal • With equation (##), all pj = 1/n • Conclusion?

Notation A Reassuring Introduction to SVM • Let x=(x1,x2,…,xn) and λ=(λ1,λ2,…,λm) • Again, we write Lagrangian as L(x,λ) = f(x) + Σλi (gi(x) – ci) • Note: L is a function of n+m variables • Can view the problem as… • Constraints gi define a feasible region • Maximize the objective function f over this feasible region

Lagrangian Duality A Reassuring Introduction to SVM • For Lagrange multipliers… • Primal problem: max min L(x,y,λ) • Where max over (x,y) and min over λ • Dual problem: min max L(x,y,λ) • As above, max over (x,y) and min over λ • We claim it’s easy to see that min max L(x,y,λ) ≥ max min L(x,y,λ) • Why is this true? Next slide...

Dual Problem A Reassuring Introduction to SVM • Recall J(x,y) = f(x,y) + I(x,y) • Where I(x,y) is 0 whenever g(x,y) = c and -∞ otherwise • And max J(x,y) is a solution • Then L(x,y,λ) ≥ J(x,y) • And max L(x,y,λ) ≥ max J(x,y)for all λ • Therefore, min max L(x,y,λ) ≥ max J(x,y) • min max L(x,y,λ) ≥ max min L(x,y,λ)

Dual Problem A Reassuring Introduction to SVM • So, we have shown that dual problem provides upper bound • min max L(x,y,λ) ≥ max min L(x,y,λ) • That is, dual solution ≥ primal solution • But it’s even better than that • For Lagrangian, equality holds true • Why equality? • Because Lagrangian is convex function

Primal Problem A Reassuring Introduction to SVM • Maximize: f(x,y) = 16 – (x2 + y2) • Subject to: 2x – y = 4 • Then L(x,y,λ) = 16 – (x2 + y2) + λ(2x – y - 4) • Compute partial derivatives… dL/dx = -2x + 2λ = 0 dL/dy = -2y – λ = 0 dL/dλ = 2x – y – 4 = 0 • Result: (x,y,λ) = (-8/5,4/5,-8/5) • Which yields max of f(x,y) = 64/5

Dual Problem A Reassuring Introduction to SVM • Maximize: f(x,y) = 16 – (x2 + y2) • Subject to: 2x – y = 4 • Then L(x,y,λ) = 16–(x2+y2)+λ(2x–y-4) • Recall that dual problem is min max L(x,y,λ) • Where max is over (x,y), min is over λ • How can we solve this?

Dual Problem A Reassuring Introduction to SVM • Dual problem: min max L(x,y,λ) • So, can first take max of L over (x,y) • Then we are left with function Lonly in λ • To solve problem, then find minL(λ) • On next slide, we illustrate this for L(x,y,λ) = 16 – (x2 + y2) + λ(2x – y - 4) • Same example as considered above

Dual Problem A Reassuring Introduction to SVM • Given L(x,y,λ) = 16 – (x2 + y2) + λ(2x – y – 4) • Maximize over (x,y) by computing dL/dx = -2x + 2λ = 0 dL/dy = -2y - λ = 0 • Which implies x = λand y = -λ/2 • Substitute these into L to obtain L(λ) = 5/4λ2 + 4λ + 16

Dual Problem A Reassuring Introduction to SVM • Original problem • Maximize: f(x,y) = 16 – (x2 + y2) • Subject to: 2x – y = 4 • Solution can be found by minimizing L(λ) = 5/4λ2 + 4λ + 16 • Then L’(λ) = 5/2λ + 4 = 0, which gives λ = -8/5 and (x,y) = (-8/5,4/5) • Same solution as the primal problem!

Summary of Dual Problem A Reassuring Introduction to SVM • Maximize L to find (x,y) in terms of λ • Then rewrite L as function of λ only • Finally, minimize L(λ) to solve problem • But, why all of the fuss? • Dual problem allows us to write the problem in much more user-friendly way • In SVM, we’ll consider dual of L(λ)

Lagrange Multipliers and SVM A Reassuring Introduction to SVM • Lagrange multipliers very cool indeed • But what does this have to do with SVM? • Can view (soft) margin computation as constrained optimization problem • In this form, kernel trick becomes clear • We can kill 2 birds with 1 stone • Make margin calculation clearer • Make kernel trick perfectly clear

Problem Setup A Reassuring Introduction to SVM • Let X1,X2,…,Xn be data pts (vectors) • Each Xi = (xi,yi) a point in the plane • In general, could be higher dimension • Let z1,z2,…,zn be corresponding class labels, where each zi{-1,1} • Where zi = 1 if classified as “red” type • And zi = -1 if classified as “blue” type • Note this is a binary classification

Geometric View y m x A Reassuring Introduction to SVM • Equation of yellow line w1x + w2y + b = 0 • Equation of red line w1x + w2y + b = 1 • Equation of blue line w1x + w2y + b = -1 • Margin m is length of green line

A Reassuring Introduction to Support Vector Machines

A Reassuring Introduction to Support Vector Machines

Presentation Transcript

A Simple Introduction to Support Vector Machines

Support Vector Machines

An Introduction to Support Vector Machines

Support Vector Machines

Support Vector Machines

Support Vector Machines

Support Vector Machines

An Introduction to Support Vector Machines

Support Vector Machines

Support Vector Machines

Support Vector Machines

An Introduction to Support Vector Machines

Support Vector Machines

Support Vector Machines

Introduction to Support Vector Machines (SVM)

Support Vector Machines

Support Vector Machines

An Introduction to Support Vector Machines

Support Vector Machines