250 likes | 374 Views
This announcement highlights an upcoming seminar by Lise Getoor on Support Vector Machines (SVM) and their application in structured classification, specifically through the lens of the Kernel Trick. The session will discuss key concepts such as the comparison between Perceptrons and SVMs, the importance of optimal weight calculation under constraints, and the role of kernels in allowing for non-linear classification. Attendees can expect an in-depth dialogue on enhancing SVMs for structured data through innovative approaches.
E N D
Support Vector Machines for Structured Classification and The Kernel Trick William Cohen 3-6-2007
Announcements • Don’t miss this one: • Lise Getoor, 2:30 in Newell-Simon 3305
^ ^ If mistake: vk+1 = vk + yixi Compute: yi = vk . xi yi yi The voted perceptron instancexi B A
v2 (3a) The guess v2 after the two positive examples: v2=v1+x2 (3b) The guess v2 after the one positive and one negative example: v2=v1-x2 u u +x2 v2 >γ v1 v1 +x1 -x2 -u -u 2γ 2γ
Perceptrons vs SVMs • For the voted perceptron to “work” (in this proof), we need to assume there is some u such that
Perceptrons vs SVMs • Question: why not use this assumption directly in the learning algorithm? i.e. • Given: γ, (x1,y1), (x2,y2), (x3,y3), … • Find: some w where • ||w||=1 and • for all i, w.xi.yi> γ
Perceptrons vs SVMs • Question: why not use this assumption directly in the learning algorithm? i.e. • Given: (x1,y1), (x2,y2), (x3,y3), … • Find: some w and γ such that • ||w||=1 and • for all i, w.xi.yi> γ The best possible w and γ
Perceptrons vs SVMs • Question: why not use this assumption directly in the learning algorithm? i.e. • Given: (x1,y1), (x2,y2), (x3,y3), … • Maximize γ under the constraint • ||w||=1 and • for all i, w.xi.yi> γ • Mimimize ||w||2 under the constraint • for all i, w.xi.yi> 1 Units are arbitrary: rescaling increases γ and w almost Thorsten’s eq (5-6), SVM0
^ Compute: yi = vk . xi Return: the index b* of the “best” xi If mistake: vk+1 = vk + xb -xb* b b* The voted perceptron for ranking instancesx1 x2 x3 x4… B A
v2 (3a) The guess v2 after the two positive examples: v2=v1+x2 u +x2 >γ v1 -u 2γ 3
^ Compute: yi = vk . zi Return: the index b* of the “best” zi If mistake: vk+1 = vk + zb -zb* F(xi,y*) F(xi,yi) b* b The voted perceptron for NER instancesz1 z2 z3 z4… B A • A sends B the Sha & Pereira paper and instructions for creating the instances: • A sends a word vector xi. Then B could create the instances F(xi,y)….. • but instead B just returns the y* that gives the best score for the dot product vk . F(xi,y*) by using Viterbi. • A sends B the correct label sequence yi. • On errors, B sets vk+1 = vk + zb -zb* = vk + F(xi,y) - F(xi,y*)
^ Compute: yi = vk . zi Return: the index b* of the “best” zi If mistake: vk+1 = vk + zb -zb* b b* The voted perceptron for NER instancesz1 z2 z3 z4… B A • A sends a word vector xi. • B just returns the y* that gives the best score for vk . F(xi,y*) • A sends B the correct label sequence yi. • On errors, B sets vk+1 = vk + zb -zb* = vk + F(xi,y) - F(xi,y*)
SVM for ranking: assumptions • Mimimize ||w||2 under the constraint • for all i, w.xi.yi> 1 suggests algorithm Thorsten’s eq (5-6), SVM0 Assumption:
^ ^ If mistake: vk+1 = vk + yixi Compute: yi = vk . xi yi yi The voted perceptron instancexi B A
The kernel trick Remember: where i1,…,ik are the mistakes… so:
The kernel trick – con’t Since: where i1,…,ik are the mistakes… then Consider a preprocesser that replaces every x with x’ to include, directly in the example, all the pairwise variable interactions, so what is learned is a vector v’:
The kernel trick – con’t A voted perceptron over vectors like u,v is a linear function… Replacing u with u’ would lead to non-linear functions – f(x,y,xy,x2,…)
The kernel trick – con’t But notice…if we replace uv with (uv+1)2 …. Compare to
The kernel trick – con’t So – up to constants on the cross-product terms Why not replace the computation of With the computation of where ?
The kernel trick – con’t General idea: replace an expensive preprocessor xx’ and ordinary inner product with no preprocessor and a function K(x,xi) where This is really useful when you want to learn over objects x with some non-trivial structure….as in the two Mooney papers.
The kernel trick – con’t • Even more general idea: use any function K that is • Continuous • Symmetric—i.e., K(u,v)=K(v,u) • “Positive semidefinite”—i.e., K(u,v)≥0 • Then by an ancient theorem due to Mercer, K corresponds to some combination of a preprocessor and an inner product: i.e., Terminology: K is a Mercer kernel. The set of all x’ is a reproducing kernel Hilbert space (RKHS). The matrix M[i,j]=K(xi,xj) is a Gram matrix.