1 / 72

Support Vector Machines

Support Vector Machines. Mark Stamp. Supervised vs Unsupervised. Often use supervised learning That is, training relies on labeled data Training data must be pre-processed In contrast, unsupervised learning… …uses unlabeled data No pre-processing required for training

Download Presentation

Support Vector Machines

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Support Vector Machines • Mark Stamp SVM

  2. Supervised vs Unsupervised SVM • Often use supervised learning • That is, training relies on labeled data • Training data must be pre-processed • In contrast, unsupervised learning… • …uses unlabeled data • No pre-processing required for training • Also semi-supervised algorithms • Supervised, but not too much…

  3. HMM for Supervised Learning SVM • Suppose we want to use HMM for malware detection • Train model on set of malware • All from a particular family • Labeled as malware of that type • Test to see how well it distinguishes malware from benign • An example of supervised learning

  4. Semi-Supervised Learning SVM • Recall HMM for English text example • Using N = 2, we find hidden states correspond to… • Consonants and vowels • We did not specify consonants/vowels • HMM extracted this info from raw data • Semi-supervised learning? • It seems to depend on your definitions

  5. Unsupervised Learning SVM • Clustering • Good example of unsupervised learning • The only example? • For mixed dataset, goal of clustering is to reveal hidden structure • No pre-processing • Often no idea how to pre-process • Usually used in “data exploration” mode

  6. Supervised Learning SVM • English text example • Preprocess to mark consonants and vowels • Then train on this labeled data • SVM one of the most popular supervised learning method • Weconsider binary classification • I.e., 2 classes, such as consonant vs vowel • Other examples of binary classification?

  7. Support Vector Machine SVM • SVM based on 3 big ideas • Maximize the “margin” • Max separation between classes • Work in a higher dimensional space • More “room”, so easier to separate • Kernel trick • This is intimately related to 2 • Both 1 and 2 are fairly intuitive

  8. Separating Classes SVM • Consider labeled data • Binary classifier • Denote red class as 1 • And blue is class -1 • Easy to see separation • How to separate? • We’ll use a “hyperplane”… • …which is a line in 2-d

  9. Separating Hyperplanes SVM • Consider labeled data • Easy to separate • Draw a hyperplane to separate points • Classify new data based on separating hyperplane • But which hyperplanes is better? Best? • And why?

  10. Maximize Margin SVM • Margin is min distance to misclassifications • Maximize the margin • So, yellow hyperplane is better than purple • Seems like a good idea • But, not always so easy • See next slide…

  11. Separating… NOT SVM • What about this case? • Yellow line not an option • Why not? • No longer “separating” • What to do? • Allow for some errors • Hyperplane need not completely separate

  12. Soft Margin SVM • Ideally, large margin and no errors • But allowing some misclassifications might increase the margin by a lot • Relax “separating” requirement • How many errors to allow? • That’ll be a user defined parameter • Tradeoff errors vs larger margin • In practice, find “best” by trial and error

  13. Feature Space SVM • Transform data to “feature space” • Feature space is in higher dimension • But what about curse of dimensionality? • Q: Why increase dimensionality??? • A: Easier to separate in feature space • Goal is to make data “linearly separable” • Want to separate classes with hyperplane • But not pay a price for high dimensionality

  14. Higher Dimensional Space ϕ Feature space (pretend it’s in a higher dimension) Input space SVM • Why transform to “higher” dimension? • One advantage is nonlinear can be linear

  15. Cool Picture SVM A better example of what can happen by transforming to higher dimension

  16. Feature Space SVM • Usually, higher dimension is bad news • From computational complexity POV • The so-called “curse of dimensionality” • But higher dimension feature space can make data linearly separable • Can we have our cake and eat it too? • Linearly separable and easy to compute • Yes, thanks to the kernel trick

  17. Kernel Trick SVM • Enables us to work in input space • With results mapped to feature space • No work done explicitly in feature space • Computations in input space • Lower dimension, so computation easier • Results actually in feature space • Higher dimension, so easier to separate • Very cool trick!

  18. Kernel Trick SVM • Unfortunately, to understand kernel trick, must dig a little (lot?) deeper • Also makes other aspects clearer • We won’t cover every detail here • Just enough to get idea across • Well, maybe a little more than that… • We need Lagrange multipliers • But first, constrained optimization

  19. Constrained Optimization 4 2 f(x) 0 x = 1 -2 -4 -2 -1 1 2 0 SVM • “No brainer” example • Maximize: f(x) = 4 – x2 subject to x – 1 = 0 • Solution? • Max is at x = 1 • Max value is f(1) = 3 • Consider more general case next…

  20. Lagrange Multipliers SVM • Maximize f(x,y) subject to g(x,y) = c • Define the Lagrangian L(x,y,λ) = f(x,y) + λ(g(x,y) – c) • “Stationary points” of L are possible solutions to original problem • All solutions must be stationary points • Not all stationary points are solutions • Generalize: More variables/constraints

  21. Lagrangian SVM • Consider L(x,y,λ) = f(x,y) + λ(g(x,y) – c) • If g(x,y) = c, then constraint is satisfied and L(x,y,λ) = f(x,y) • Want to maximize over such (x,y) • If g(x,y) ≠ c then λ can “pull down” • Desired solution: max min L(x,y,λ) • Where max over (x,y) and min over λ

  22. Lagrangian SVM • The Lagrangian is L(x,y,λ) = f(x,y) + λ(g(x,y) – c) • And max min L(x,y,λ) solves original constrained optimization problem • Therefore, solution to original problem is a saddle point of L(x,y,λ) • That is, max in one direction and min in the other example on next slide

  23. Saddle Points SVM • Graph of L(x,λ) = 4-x2 +λ(x-1) • “No brainer” example from previous slide

  24. Stationary Points SVM • Has nothing to do with fancy paper • That’s stationery, not stationary… • Stationary point means partial derivatives are all 0, that is dL/dx = 0 anddL/dy = 0 anddL/dλ = 0 • As mentioned, this generalizes to… • More variables in functions f and g • More constraints: Σλi (gi(x,y) – ci)

  25. Another Example SVM • Lots of good geometric examples • We look at something different • Consider discrete probability distribution on n points: p1,p2,p3,…,pn • What distribution has max entropy? • Maximize entropy function • Subject to constraint that pj form a probability distribution

  26. Maximize Entropy SVM • Shannon entropy: Σpj log2pj • Have a probability distribution, so… • Require 0 ≤ pj ≤ 1 for all j, and Σpj = 1 • We will solve this problem: • Maximize f(p1,..,pn) = Σpj log2pj • Subject to constraint Σpj = 1 • How should we solve this? • Do you really have to ask?

  27. Entropy Example SVM • Recall L(x,y,λ) = f(x,y) + λ (g(x,y) – c) • Problem statement • Maximize f(p1,..,pn) = Σpj log2pj • Subject to constraint Σpj = 1 • In this case, Lagrangian is L(p1,…,pn,λ) = Σpj log2pj + λ (Σpj – 1) • Compute partial derivatives wrt each pj and partial derivative wrtλ

  28. Entropy Example SVM • Have L(x,y,λ) = Σpj log2pj + λ (Σpj – 1) • Partial derivatives wrt any pj yields log2pj + 1/ln(2) + λ = 0 (#) • And wrtλ yields the constraint Σpj – 1 = 0 orΣpj = 1 (##) • Equation (#) implies all pj are equal • With equation (##), all pj = 1/n • Conclusion?

  29. Notation SVM • Let x=(x1,x2,…,xn) and λ=(λ1,λ2,…,λm) • Then we write Lagrangian as L(x,λ) = f(x) + Σλi (gi(x) – ci) • Note: L is a function of n+m variables • Can view the problem as follows • The gi functions define a feasible region • Maximize f over this feasible region

  30. Lagrangian Duality SVM • For Lagrange multipliers… • Primal problem: max min L(x,y,λ) • Where max over (x,y) and min over λ • Dual problem: min max L(x,y,λ) • As above, max over (x,y) and min over λ • In general, min max F(x,y,λ) ≥ max min F(x,y,λ) • But for L(x,y,λ), equality holds

  31. Yet Another Example SVM Maximize: f(x,y) = 16 – (x2 + y2) Subject to: 2x – y = 4 Graph of f(x,y)

  32. Intersection SVM Intersection of f(x,y) and 2x – y = 4 What is the solution to problem?

  33. Primal Problem SVM • Maximize: f(x,y) = 16 – (x2 + y2) • Subject to: 2x – y = 4 • Then L(x,y,λ) = 16 – (x2 + y2) + λ(2x – y - 4) • Compute partial derivatives… dL/dx = -2x – 2λ = 0 dL/dy = -2y – λ = 0 dL/dλ = 2x – y – 4 = 0 • Result: (x,y,λ) = (-8/5,4/5,-8/5) • And f(x,y) = 64/5

  34. Dual Problem SVM • Maximize: f(x,y) = 16 – (x2 + y2) • Subject to: 2x – y = 4 • ThenL(x,y,λ) = 16–(x2+y2)+λ(2x–y-4) • Recall that dual problem is min max L(x,y,λ) • Where max is over (x,y), min is over λ • How can we solve this?

  35. Dual Problem SVM • Dual problem: min max L(x,y,λ) • So, can first take max of L over (x,y) • Then we are left with function L only in λ • To solve problem, find minL(λ) • On next slide, we illustrate this for L(x,y,λ) = 16 – (x2 + y2) + λ(2x – y - 4) • Same example as primal problem above

  36. Dual Problem SVM • Given L(x,y,λ) = 16 – (x2 + y2) + λ(2x – y – 4) • Maximize over (x,y) by computing dL/dx = -2x + 2λ = 0 dL/dy = -2y - λ = 0 • Which implies x = λand y = -λ/2 • Substitute these into L to obtain L(λ) = 5/4λ2 + 4λ + 16

  37. Dual Problem SVM • Original problem • Maximize: f(x,y) = 16 – (x2 + y2) • Subject to: 2x – y = 4 • Solution can be found by minimizing L(λ) = 5/4λ2 + 4λ + 16 • Then L’(λ) = 5/2λ + 4 = 0, which gives λ = -8/5 and (x,y) = (-8/5,4/5) • Same solution as the primal problem

  38. Dual Problem SVM Maximize to find (x,y) in terms of λ Then rewrite L as function of λ only Finally, minimize L(λ) to solve problem But, why all of the fuss? Dual problem allows us to write the problem in more user-friendly way In SVM, we’ll make use of L(λ)

  39. Lagrange Multipliers and SVM SVM • Lagrange multipliers very cool indeed • But what does this have to do with SVM? • Can view (soft) margin computation as constrained optimization problem • In this form, kernel trick will be clear • We can kill 2 birds with 1 stone • Make margin calculation clearer • Make kernel trick perfectly clear

  40. Problem Setup SVM • Let X1,X2,…,Xn be data points • Each Xi = (xi,yi) a point in the plane • In general, could be higher dimension • Let z1,z2,…,zn be corresponding class labels, where each zi{-1,1} • Where zi = 1 if classified as “red” type • And zi = -1 if classified as “blue” type • Note that this is a binary classifier

  41. Geometric View y x SVM • Equation of yellow line w1x + w2y + b = 0 • Equation of red line w1x + w2y + b = 1 • Equation of blue line w1x + w2y + b = -1 • Margin is distance between red and blue

  42. Geometric View y x SVM • Any red point X=(x,y) must satisfy w1x + w2y + b ≥ 1 • Any blue point X=(x,y) must satisfy w1x + w2y + b ≤ -1 • Want inequalities all true after training

  43. Geometric View y x SVM • With lines defined… • Given new data point X = (x,y) to classify • “Red” provided that w1x + w2y + b > 0 • “Blue” provided that w1x + w2y + b < 0 • This is scoring phase

  44. Geometric View y x SVM • The real question is... • How to find equation of the yellow line? • Given {Xi} and {zi} • Where Xi point in plane • And zi its classification • Finding yellow line is the training phase…

  45. Geometric View x m y SVM • Distance from origin to line Ax+By+C = 0 is |C| / sqrt(A2 + B2) • Origin to red line: |1-b| / ||W|| where W = (w1,w2) • Origin to blue line: |-1-b| / ||W|| • Margin is m = 2/||W||

  46. Training Phase SVM • Given {Xi} and {zi}, find largest margin m that classifies all points correctly • That is, find red, blue lines in picture • Recall red line is of the form w1x + w2y + b = 1 • Blue line is of the form w1x + w2y + b = -1 • And maximize margin: m = 2/||W||

  47. Training SVM • Since zi{-1,1}, correct classification occurs provided zi(w1xi + w2yi + b) ≥ 1 for all i • Training problem to solve: • Maximize: m = 2/||W|| • Subject to constraints: zi(w1xi + w2yi + b) ≥ 1 for i=1,2,…,n • Can we determine W and b ?

  48. Training SVM • The problem on previous slide is equivalent to the following • Minimize: F(W) = ||W||2 / 2 = (w12 + w22) / 2 • Subject to constraints: 1 - zi(w1xi + w2yi + b) ≤ 0 for all i • Should be starting to look familiar…

  49. Lagrangian SVM • Pretend inequalities are equalities… L(w1,w2,b,λ) = (w12 + w22) / 2 + Σ λi(1 - zi(w1xi + w2yi + b)) • Compute dL/dw1 = w1 - Σλizixi = 0 dL/dw2 = w2 - Σλiziyi = 0 dL/db = Σλizi= 0 dL/dλi = 1 - zi(w1xi + w2yi + b) = 0

  50. Lagrangian SVM • Derivatives yield constraints and W = ΣλiziXiand Σλizi= 0 • Substitute these into L yields L(λ) = Σλi – ½ ΣΣλiλjzizjXiXj Where “” is dot product: XiXj= xixj+ yiyj • Here, L is only a function of λ • We still have the constraint Σλizi= 0 • Note: If we find λi then we know W

More Related