1 / 31

Support Vector Machines and The Kernel Trick

Support Vector Machines and The Kernel Trick. William Cohen 3-26-2007. ^. ^. If mistake: v k+1 = v k + y i x i. Compute: y i = v k . x i. y i. y i. The voted perceptron. instance x i. B. A. v 2. (3a) The guess v 2 after the two positive examples: v 2 = v 1 + x 2.

addison
Download Presentation

Support Vector Machines and The Kernel Trick

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Support Vector Machines and The Kernel Trick William Cohen 3-26-2007

  2. ^ ^ If mistake: vk+1 = vk + yixi Compute: yi = vk . xi yi yi The voted perceptron instancexi B A

  3. v2 (3a) The guess v2 after the two positive examples: v2=v1+x2 (3b) The guess v2 after the one positive and one negative example: v2=v1-x2 u u +x2 v2 >γ v1 v1 +x1 -x2 -u -u 2γ 2γ

  4. Perceptrons vs SVMs • For the voted perceptron to “work” (in this proof), we need to assume there is some u such that ..or, u.u=||u||2=1

  5. Perceptrons vs SVMs • Question: why not use this assumption directly in the learning algorithm? i.e. • Given: γ, (x1,y1), (x2,y2), (x3,y3), … • Find: some w where • ||w||2=1 and • for all i, w.xi.yi> γ

  6. Perceptrons vs SVMs • Question: why not use this assumption directly in the learning algorithm? i.e. • Given: (x1,y1), (x2,y2), (x3,y3), … • Find: some w and γ such that • ||w||=1 and • for all i, w.xi.yi> γ The best possible w and γ

  7. Perceptrons vs SVMs • Question: why not use this assumption directly in the learning algorithm? i.e. • Given: (x1,y1), (x2,y2), (x3,y3), … • Maximize γ under the constraint • ||w||2=1 and • for all i, w.xi.yi> γ • Mimimize ||w||2 under the constraint • for all i, w.xi.yi> 1 Units are arbitrary: rescaling increases γ and w

  8. SVMs and optimization • Question: why not use this assumption directly in the learning algorithm? i.e. • Given: (x1,y1), (x2,y2), (x3,y3), … • Find: This is a constrained optimization problem. objective function constraints Famous example of constrained optimization: linear programming, where objective function is linear, constraints are linear (in)equalities …but here nothing is linear, so you need to use quadratic programming

  9. SVMs and optimization • Motivation for SVMs as “better perceptrons” • learners that minimizew.w under the constraint that for all i, yiw.xi>1 • Questions: • What if the data isn’t separable? • Slack variables • Kernel trick • How do you solve this constrained optimization problem?

  10. SVMs and optimization • Question: why not use this assumption directly in the learning algorithm? i.e. • Given: (x1,y1), (x2,y2), (x3,y3), … • Find:

  11. SVM with slack variables http://www.csie.ntu.edu.tw/~cjlin/libsvm/

  12. The Kernel Trick

  13. ^ ^ If mistake: vk+1 = vk + yixi Compute: yi = vk . xi yi yi The voted perceptron instancexi B A

  14. The kernel trick Can think of this as a weighted sum of all examples with some of the weights being zero – non-zero weighted examples are support vectors sparse weighted sum of examples Remember: where i1,…,ik are the mistakes… so:

  15. The kernel trick – con’t Since: where i1,…,ik are the mistakes… then Consider a preprocesser that replaces every x with x’ to include, directly in the example, all the pairwise variable interactions, so what is learned is a vector v’:

  16. The kernel trick – con’t A voted perceptron over vectors like u,v is a linear function… Replacing u with u’ would lead to non-linear functions – f(x,y,xy,x2,…)

  17. The kernel trick – con’t But notice…if we replace uv with (uv+1)2 …. Compare to

  18. The kernel trick – con’t So – up to constants on the cross-product terms Why not replace the computation of With the computation of where ?

  19. The kernel trick – con’t General idea: replace an expensive preprocessor xx’ and ordinary inner product with no preprocessor and a function K(x,xi) where Some popular kernels for numeric vectors x:

  20. Demo with An Applet http://www.site.uottawa.ca/~gcaron/SVMApplet/SVMApplet.html

  21. The kernel trick – con’t • Kernels work for other data structures also! • String kernels: • x and xi are strings, S=set of shared substrings, |s|=length of string s: by dynamic programming you can quickly compute There are also tree kernels, graph kernels, …..

  22. The kernel trick – con’t x=“william” j={1,3,4} x[j]=“wll” “wll”<“wl” len(x,j)=4 • Kernels work for other data structures also! • String kernels: • x and xi are strings, S=set of shared substrings, j,k are subsets of the positions inside x,xi, len(x,j) is the distance between the first position in j and the last, s<t means s is a substring of t, by dynamic programming you can quickly compute

  23. The kernel trick – con’t • Even more general idea: use any function K that is • Continuous • Symmetric—i.e., K(u,v)=K(v,u) • “Positive semidefinite”—i.e., K(u,v)≥0 • Then by an ancient theorem due to Mercer, K corresponds to some combination of a preprocessor and an inner product: i.e., Terminology: K is a Mercer kernel. The set of all x’ is a reproducing kernel Hilbert space (RKHS). The matrix M[i,j]=K(xi,xj) is a Gram matrix.

  24. SVMs and optimization • Question: why not use this assumption directly in the learning algorithm? i.e. • Given: (x1,y1), (x2,y2), (x3,y3), … • Find: primal form which is equivalent to finding: Lagrangian dual

  25. Langrange multipliers maximize f(x,y)=2-x2-2y2 subject to g(x)=x2+y2-1=0

  26. Langrange multipliers maximize f(x,y)=2-x2-2y2 subject to g(x)=x2+y2-1=0 Claim: at the constrained maximum the gradient of f must be perpendicular to g

  27. Langrange multipliers maximize f(x,y)=2-x2-2y2 subject to g(x)=x2+y2-1=0 Claim: at the constrained maximum the gradient of f must be perpendicular to g

  28. SVMs and optimization • Question: why not use this assumption directly in the learning algorithm? i.e. • Given: (x1,y1), (x2,y2), (x3,y3), … • Find: primal form which is equivalent to finding: Lagrangian dual

  29. SVMs and optimization • Question: why not use this assumption directly in the learning algorithm? i.e. • Given: (x1,y1), (x2,y2), (x3,y3), … • Find: • Some key points: • Solving the QP directly (Vapnik’s original method) is possible but expensive. • The dual form can be expressed as constraints on each example • eg. αi=0 yiw.xi≥1 • Fastest methods for SVM learning ignore most of the constraints, solve a subproblem containing a few ‘active constraints’, then cleverly pick a few additional constraints & repeat….. KKT (Karush-Kuhn-Tucker) conditions or Kuhn-Tucker conditions, after Karush (1939) and Kuhn-Tucker (1951)

  30. More on SVMs and kernels • Many other types of algorithms can be “kernelized” • Gaussian processes, memory-based/nearest neighbor methods, …. • Work on optimization for linear SVMs is very active

More Related