Support Vector Machines for Structured Classification and the Kernel Trick

Support Vector Machines for Structured Classification and The Kernel Trick William Cohen 3-6-2007

Announcements • Don’t miss this one: • Lise Getoor, 2:30 in Newell-Simon 3305

^ ^ If mistake: vk+1 = vk + yixi Compute: yi = vk . xi yi yi The voted perceptron instancexi B A

v2 (3a) The guess v2 after the two positive examples: v2=v1+x2 (3b) The guess v2 after the one positive and one negative example: v2=v1-x2 u u +x2 v2 >γ v1 v1 +x1 -x2 -u -u 2γ 2γ

Perceptrons vs SVMs • For the voted perceptron to “work” (in this proof), we need to assume there is some u such that

Perceptrons vs SVMs • Question: why not use this assumption directly in the learning algorithm? i.e. • Given: γ, (x1,y1), (x2,y2), (x3,y3), … • Find: some w where • ||w||=1 and • for all i, w.xi.yi> γ

Perceptrons vs SVMs • Question: why not use this assumption directly in the learning algorithm? i.e. • Given: (x1,y1), (x2,y2), (x3,y3), … • Find: some w and γ such that • ||w||=1 and • for all i, w.xi.yi> γ The best possible w and γ

Perceptrons vs SVMs • Question: why not use this assumption directly in the learning algorithm? i.e. • Given: (x1,y1), (x2,y2), (x3,y3), … • Maximize γ under the constraint • ||w||=1 and • for all i, w.xi.yi> γ • Mimimize ||w||2 under the constraint • for all i, w.xi.yi> 1 Units are arbitrary: rescaling increases γ and w almost Thorsten’s eq (5-6), SVM0

^ Compute: yi = vk . xi Return: the index b* of the “best” xi If mistake: vk+1 = vk + xb -xb* b b* The voted perceptron for ranking instancesx1 x2 x3 x4… B A

v2 (3a) The guess v2 after the two positive examples: v2=v1+x2 u +x2 >γ v1 -u 2γ 3

^ Compute: yi = vk . zi Return: the index b* of the “best” zi If mistake: vk+1 = vk + zb -zb* F(xi,y*) F(xi,yi) b* b The voted perceptron for NER instancesz1 z2 z3 z4… B A • A sends B the Sha & Pereira paper and instructions for creating the instances: • A sends a word vector xi. Then B could create the instances F(xi,y)….. • but instead B just returns the y* that gives the best score for the dot product vk . F(xi,y*) by using Viterbi. • A sends B the correct label sequence yi. • On errors, B sets vk+1 = vk + zb -zb* = vk + F(xi,y) - F(xi,y*)

^ Compute: yi = vk . zi Return: the index b* of the “best” zi If mistake: vk+1 = vk + zb -zb* b b* The voted perceptron for NER instancesz1 z2 z3 z4… B A • A sends a word vector xi. • B just returns the y* that gives the best score for vk . F(xi,y*) • A sends B the correct label sequence yi. • On errors, B sets vk+1 = vk + zb -zb* = vk + F(xi,y) - F(xi,y*)

Thorsten’s notation vs mine

SVM for ranking: assumptions

SVM for ranking: assumptions • Mimimize ||w||2 under the constraint • for all i, w.xi.yi> 1 suggests algorithm Thorsten’s eq (5-6), SVM0 Assumption:

The Kernel Trick

^ ^ If mistake: vk+1 = vk + yixi Compute: yi = vk . xi yi yi The voted perceptron instancexi B A

The kernel trick Remember: where i1,…,ik are the mistakes… so:

The kernel trick – con’t Since: where i1,…,ik are the mistakes… then Consider a preprocesser that replaces every x with x’ to include, directly in the example, all the pairwise variable interactions, so what is learned is a vector v’:

The kernel trick – con’t A voted perceptron over vectors like u,v is a linear function… Replacing u with u’ would lead to non-linear functions – f(x,y,xy,x2,…)

The kernel trick – con’t But notice…if we replace uv with (uv+1)2 …. Compare to

The kernel trick – con’t So – up to constants on the cross-product terms Why not replace the computation of With the computation of where ?

The kernel trick – con’t General idea: replace an expensive preprocessor xx’ and ordinary inner product with no preprocessor and a function K(x,xi) where This is really useful when you want to learn over objects x with some non-trivial structure….as in the two Mooney papers.

The kernel trick – con’t • Even more general idea: use any function K that is • Continuous • Symmetric—i.e., K(u,v)=K(v,u) • “Positive semidefinite”—i.e., K(u,v)≥0 • Then by an ancient theorem due to Mercer, K corresponds to some combination of a preprocessor and an inner product: i.e., Terminology: K is a Mercer kernel. The set of all x’ is a reproducing kernel Hilbert space (RKHS). The matrix M[i,j]=K(xi,xj) is a Gram matrix.

Support Vector Machines for Structured Classification and the Kernel Trick