1 / 25

Support Vector Machines for Structured Classification and The Kernel Trick

Support Vector Machines for Structured Classification and The Kernel Trick. William Cohen 3-6-2007. Announcements. Don’t miss this one: Lise Getoor, 2:30 in Newell-Simon 3305. ^. ^. If mistake: v k+1 = v k + y i x i. Compute: y i = v k . x i. y i. y i.

Download Presentation

Support Vector Machines for Structured Classification and The Kernel Trick

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Support Vector Machines for Structured Classification and The Kernel Trick William Cohen 3-6-2007

  2. Announcements • Don’t miss this one: • Lise Getoor, 2:30 in Newell-Simon 3305

  3. ^ ^ If mistake: vk+1 = vk + yixi Compute: yi = vk . xi yi yi The voted perceptron instancexi B A

  4. v2 (3a) The guess v2 after the two positive examples: v2=v1+x2 (3b) The guess v2 after the one positive and one negative example: v2=v1-x2 u u +x2 v2 >γ v1 v1 +x1 -x2 -u -u 2γ 2γ

  5. Perceptrons vs SVMs • For the voted perceptron to “work” (in this proof), we need to assume there is some u such that

  6. Perceptrons vs SVMs • Question: why not use this assumption directly in the learning algorithm? i.e. • Given: γ, (x1,y1), (x2,y2), (x3,y3), … • Find: some w where • ||w||=1 and • for all i, w.xi.yi> γ

  7. Perceptrons vs SVMs • Question: why not use this assumption directly in the learning algorithm? i.e. • Given: (x1,y1), (x2,y2), (x3,y3), … • Find: some w and γ such that • ||w||=1 and • for all i, w.xi.yi> γ The best possible w and γ

  8. Perceptrons vs SVMs • Question: why not use this assumption directly in the learning algorithm? i.e. • Given: (x1,y1), (x2,y2), (x3,y3), … • Maximize γ under the constraint • ||w||=1 and • for all i, w.xi.yi> γ • Mimimize ||w||2 under the constraint • for all i, w.xi.yi> 1 Units are arbitrary: rescaling increases γ and w almost Thorsten’s eq (5-6), SVM0

  9. ^ Compute: yi = vk . xi Return: the index b* of the “best” xi If mistake: vk+1 = vk + xb -xb* b b* The voted perceptron for ranking instancesx1 x2 x3 x4… B A

  10. v2 (3a) The guess v2 after the two positive examples: v2=v1+x2 u +x2 >γ v1 -u 2γ 3

  11. ^ Compute: yi = vk . zi Return: the index b* of the “best” zi If mistake: vk+1 = vk + zb -zb* F(xi,y*) F(xi,yi) b* b The voted perceptron for NER instancesz1 z2 z3 z4… B A • A sends B the Sha & Pereira paper and instructions for creating the instances: • A sends a word vector xi. Then B could create the instances F(xi,y)….. • but instead B just returns the y* that gives the best score for the dot product vk . F(xi,y*) by using Viterbi. • A sends B the correct label sequence yi. • On errors, B sets vk+1 = vk + zb -zb* = vk + F(xi,y) - F(xi,y*)

  12. ^ Compute: yi = vk . zi Return: the index b* of the “best” zi If mistake: vk+1 = vk + zb -zb* b b* The voted perceptron for NER instancesz1 z2 z3 z4… B A • A sends a word vector xi. • B just returns the y* that gives the best score for vk . F(xi,y*) • A sends B the correct label sequence yi. • On errors, B sets vk+1 = vk + zb -zb* = vk + F(xi,y) - F(xi,y*)

  13. Thorsten’s notation vs mine

  14. SVM for ranking: assumptions

  15. SVM for ranking: assumptions • Mimimize ||w||2 under the constraint • for all i, w.xi.yi> 1 suggests algorithm Thorsten’s eq (5-6), SVM0 Assumption:

  16. The Kernel Trick

  17. ^ ^ If mistake: vk+1 = vk + yixi Compute: yi = vk . xi yi yi The voted perceptron instancexi B A

  18. The kernel trick Remember: where i1,…,ik are the mistakes… so:

  19. The kernel trick – con’t Since: where i1,…,ik are the mistakes… then Consider a preprocesser that replaces every x with x’ to include, directly in the example, all the pairwise variable interactions, so what is learned is a vector v’:

  20. The kernel trick – con’t A voted perceptron over vectors like u,v is a linear function… Replacing u with u’ would lead to non-linear functions – f(x,y,xy,x2,…)

  21. The kernel trick – con’t But notice…if we replace uv with (uv+1)2 …. Compare to

  22. The kernel trick – con’t So – up to constants on the cross-product terms Why not replace the computation of With the computation of where ?

  23. The kernel trick – con’t General idea: replace an expensive preprocessor xx’ and ordinary inner product with no preprocessor and a function K(x,xi) where This is really useful when you want to learn over objects x with some non-trivial structure….as in the two Mooney papers.

  24. The kernel trick – con’t • Even more general idea: use any function K that is • Continuous • Symmetric—i.e., K(u,v)=K(v,u) • “Positive semidefinite”—i.e., K(u,v)≥0 • Then by an ancient theorem due to Mercer, K corresponds to some combination of a preprocessor and an inner product: i.e., Terminology: K is a Mercer kernel. The set of all x’ is a reproducing kernel Hilbert space (RKHS). The matrix M[i,j]=K(xi,xj) is a Gram matrix.

More Related