1 / 57

What is a Support Vector Machine

CS 540, University of Wisconsin-Madison, C. R. Dyer. What are Support Vector Machines Used For?. Classification Regression and data-fitting Supervised and unsupervised learning. CS 540, University of Wisconsin-Madison, C. R. Dyer. Linear Classifiers. f . . x. . y. denotes 1denotes -1. .

bobby
Download Presentation

What is a Support Vector Machine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. CS 540, University of Wisconsin-Madison, C. R. Dyer What is a Support Vector Machine? An optimally defined surface Typically nonlinear in the input space Linear in a higher dimensional space Implicitly defined by a kernel function

    2. CS 540, University of Wisconsin-Madison, C. R. Dyer What are Support Vector Machines Used For? Classification Regression and data-fitting Supervised and unsupervised learning

    3. CS 540, University of Wisconsin-Madison, C. R. Dyer Linear Classifiers

    4. CS 540, University of Wisconsin-Madison, C. R. Dyer Linear Classifiers (aka Linear Discriminant Functions) Definition It is a function that is a linear combination of the components of the input x where w is the weight vector and b the bias A two-category classifier then uses the rule: Decide class c1 if f(x) > 0 and class c2 if f(x) < 0 ? Decide c1 if wTx > -b and c2 otherwise

    5. CS 540, University of Wisconsin-Madison, C. R. Dyer Linear Classifiers

    6. CS 540, University of Wisconsin-Madison, C. R. Dyer Linear Classifiers

    7. CS 540, University of Wisconsin-Madison, C. R. Dyer Linear Classifiers

    8. CS 540, University of Wisconsin-Madison, C. R. Dyer Linear Classifiers

    9. CS 540, University of Wisconsin-Madison, C. R. Dyer Classifier Margin

    10. CS 540, University of Wisconsin-Madison, C. R. Dyer Maximum Margin

    11. CS 540, University of Wisconsin-Madison, C. R. Dyer Maximum Margin

    12. CS 540, University of Wisconsin-Madison, C. R. Dyer Why Maximum Margin?

    13. CS 540, University of Wisconsin-Madison, C. R. Dyer Specifying a Line and Margin How do we represent this mathematically? … in d input dimensions? An example x = (x1, …, xd)T

    14. CS 540, University of Wisconsin-Madison, C. R. Dyer Specifying a Line and Margin Plus-plane = { wT · x + b = +1 } Minus-plane = { wT · x + b = -1 }

    15. CS 540, University of Wisconsin-Madison, C. R. Dyer Computing the Margin Plus-plane = { w x + b = +1 } Minus-plane = { w x + b = -1 } Claim: The vector w is perpendicular to the Plus-Plane

    16. CS 540, University of Wisconsin-Madison, C. R. Dyer Computing the Margin Plus-plane = { w x + b = +1 } Minus-plane = { w x + b = -1 } Claim: The vector w is perpendicular to the Plus Plane. Why?

    17. CS 540, University of Wisconsin-Madison, C. R. Dyer

    18. CS 540, University of Wisconsin-Madison, C. R. Dyer Computing the Margin Plus-plane = { w x + b = +1 } Minus-plane = { w x + b = -1 } The vector w is perpendicular to the Plus Plane Let x- be any point on the minus plane Let x+ be the closest plus-plane-point to x-

    19. CS 540, University of Wisconsin-Madison, C. R. Dyer Computing the Margin Plus-plane = { w x + b = +1 } Minus-plane = { w x + b = -1 } The vector w is perpendicular to the Plus Plane Let x- be any point on the minus plane Let x+ be the closest plus-plane-point to x- Claim: x+ = x- + l w for some value of l. Why?

    20. CS 540, University of Wisconsin-Madison, C. R. Dyer Computing the Margin Plus-plane = { w x + b = +1 } Minus-plane = { w x + b = -1 } The vector w is perpendicular to the Plus Plane Let x- be any point on the minus plane Let x+ be the closest plus-plane-point to x- Claim: x+ = x- + l w for some value of l. Why?

    21. CS 540, University of Wisconsin-Madison, C. R. Dyer Computing the Margin What we know: w x+ + b = +1 w x- + b = -1 x+ = x- + l w |x+ - x- | = M It’s now easy to get M in terms of w and b

    22. CS 540, University of Wisconsin-Madison, C. R. Dyer Computing the Margin What we know: w x+ + b = +1 w x- + b = -1 x+ = x- + l w |x+ - x- | = M It’s now easy to get M in terms of w and b

    23. CS 540, University of Wisconsin-Madison, C. R. Dyer Computing the Margin What we know: w x+ + b = +1 w x- + b = -1 x+ = x- + l w |x+ - x- | = M

    24. CS 540, University of Wisconsin-Madison, C. R. Dyer Learning the Maximum Margin Classifier Given a guess of w and b we can Compute whether all data points in the correct half-planes Compute the width of the margin So now we just need to write a program to search the space of w’s and b’s to find the widest margin that matches all the data points. How?

    25. CS 540, University of Wisconsin-Madison, C. R. Dyer Learning via Quadratic Programming QP is a well-studied class of optimization algorithms to maximize a quadratic function of some real-valued variables subject to linear constraints Minimize subject to w x + b ? +1 if x in class 1 w x + b ? -1 if x in class 2

    26. CS 540, University of Wisconsin-Madison, C. R. Dyer Learning the Maximum Margin Classifier Given guess of w, b, we can Compute whether all data points are in the correct half-planes Compute the margin width Assume N examples, each (xk , yk) where yk = +/- 1

    27. CS 540, University of Wisconsin-Madison, C. R. Dyer Learning the Maximum Margin Classifier Given guess of w , b we can Compute whether all data points are in the correct half-planes Compute the margin width Assume N examples, each (xk , yk) where yk = +/- 1

    28. CS 540, University of Wisconsin-Madison, C. R. Dyer Uh-oh!

    29. CS 540, University of Wisconsin-Madison, C. R. Dyer Uh-oh!

    30. CS 540, University of Wisconsin-Madison, C. R. Dyer Uh-oh!

    31. CS 540, University of Wisconsin-Madison, C. R. Dyer Uh-oh!

    32. CS 540, University of Wisconsin-Madison, C. R. Dyer Uh-oh!

    33. CS 540, University of Wisconsin-Madison, C. R. Dyer Learning Maximum Margin with Noise Given guess of w, b, we can Compute sum of distances of points to their correct zones Compute the margin width Assume N examples, each (xk , yk) where yk = +/- 1

    34. CS 540, University of Wisconsin-Madison, C. R. Dyer Learning Maximum Margin with Noise Given guess of w , b we can Compute sum of distances of points to their correct zones Compute the margin width Assume N examples, each (xk , yk) where yk = +/- 1

    35. CS 540, University of Wisconsin-Madison, C. R. Dyer Learning Maximum Margin with Noise Given guess of w , b we can Compute sum of distances of points to their correct zones Compute the margin width Assume R datapoints, each (xk,yk) where yk = +/- 1

    36. CS 540, University of Wisconsin-Madison, C. R. Dyer Learning Maximum Margin with Noise Given guess of w , b we can Compute sum of distances of points to their correct zones Compute the margin width Assume N examples, each (xk,yk) where yk = +/- 1

    37. CS 540, University of Wisconsin-Madison, C. R. Dyer Learning Maximum Margin with Noise Given guess of w , b we can Compute sum of distances of points to their correct zones Compute the margin width Assume N examples, each (xk,yk) where yk = +/- 1

    38. CS 540, University of Wisconsin-Madison, C. R. Dyer An Equivalent QP

    39. CS 540, University of Wisconsin-Madison, C. R. Dyer An Equivalent QP

    40. CS 540, University of Wisconsin-Madison, C. R. Dyer An Equivalent QP

    41. CS 540, University of Wisconsin-Madison, C. R. Dyer Suppose we’re in 1 Dimension

    42. CS 540, University of Wisconsin-Madison, C. R. Dyer Suppose we’re in 1 Dimension

    43. CS 540, University of Wisconsin-Madison, C. R. Dyer Harder 1-Dimensional Dataset

    44. CS 540, University of Wisconsin-Madison, C. R. Dyer Harder 1-Dimensional Dataset

    45. CS 540, University of Wisconsin-Madison, C. R. Dyer Harder 1-Dimensional Dataset

    46. CS 540, University of Wisconsin-Madison, C. R. Dyer

    47. CS 540, University of Wisconsin-Madison, C. R. Dyer Project examples into some higher dimensional space where the data is linearly separable, defined by z = F(x) Training depends only on dot products of the form F(xi) · F(xj) Example: K(xi, xj) = F(xi) · F(xj) = (xi · xj)2 Dimensionality of z space is generally much larger than the dimension of input space x

    48. CS 540, University of Wisconsin-Madison, C. R. Dyer Common SVM Basis Functions

    49. CS 540, University of Wisconsin-Madison, C. R. Dyer SVM Kernel Functions K(a,b)=(a . b +1)d is an example of an SVM kernel function Beyond polynomials there are other very high dimensional basis functions that can be made practical by finding the right kernel function Radial-Basis-style Kernel Function: Neural-Net-style Kernel Function:

    50. CS 540, University of Wisconsin-Madison, C. R. Dyer The Federalist Papers

    51. CS 540, University of Wisconsin-Madison, C. R. Dyer Description of the Data

    52. CS 540, University of Wisconsin-Madison, C. R. Dyer Function Words Based on Relative Frequencies

    53. CS 540, University of Wisconsin-Madison, C. R. Dyer SLA Feature Selection for Classifying the Disputed Federalist Papers

    54. CS 540, University of Wisconsin-Madison, C. R. Dyer Hyperplane Classifier Using 3 Words

    55. CS 540, University of Wisconsin-Madison, C. R. Dyer Results: 3D Plot of Hyperplane

    56. CS 540, University of Wisconsin-Madison, C. R. Dyer Multi-Class Classification SVMs can only handle two-class outputs What can be done? Answer: for N-class problems, learn N SVM’s: SVM 1, f1, learns “Output=1” vs “Output ? 1” SVM 2, f2, learns “Output=2” vs “Output ? 2” : SVM N, fN, learns “Output=N” vs “Output ? N”

    57. CS 540, University of Wisconsin-Madison, C. R. Dyer Multi-Class Classification Ideally, only one fi(x) > 0 and all others <0, but this is not often the case in practice Instead, to predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region: Classify as class Ci if fi(x) = max { fj(x) } for all j

    58. CS 540, University of Wisconsin-Madison, C. R. Dyer Summary Learning linear functions Pick separating plane that maximizes margin Separating plane defined in terms of support vectors only Learning non-linear functions Project examples into higher dimensional space Use kernel functions for efficiency Generally avoids over-fitting problem Global optimization method; no local optima Can be expensive to apply, especially for multi-class problems

More Related