1 / 37

Kernels for dummies Tea talk September 22, 2014

Understand the concept of kernels without complicated equations. Learn how kernels can solve optimization problems and improve smoothness.

braby
Download Presentation

Kernels for dummies Tea talk September 22, 2014

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Kernels for dummies Tea talk September 22, 2014

  2. I find kernel talks confusing. • Equations seem to appear out of nowhere. • It’s hard for me to extract the main message. • I still don’t know what an RKHS is. • Which is a little strange, considering I’ve know about Hilbert spaces since before most of you were born.

  3. So today we’re going to attempt to demystify kernels. My claim: if you understand linear algebra, you’ll understand kernels. There will be no RKHSs in this talk. And no Polish spaces. And no mean embeddings. But also no proofs; basically I’m going to avoid the hard stuff. Instead, just the intuition!

  4. We’ll start with a pretty standard problem: • You have samples from some distribution, p(x). • You want to minimize • F(dxf(x) p(x)) • with respect to f(x). • I’ll give examples later, but just about every kernel talk you have ever heard considers this problem, or a slight generalization.

  5. Our problem: ~ f(x) = arg minF(dxf(x) p(x)) ~ f(x) p(x) x ~ p

  6. Our problem: ~ f(x) = arg minF(dxf(x) p(x)) ~ f(x) p(x) x ~ p 1 p(x) = i (x-xi) n

  7. 1 dxf(x)i (x-xi) Our problem: n ~ f(x) = arg minF(dxf(x) p(x)) ~ f(x) f(x) = +(x-x1) => dxf(x) p(x) = + f(x) = –(x-x1) => dxf(x) p(x) = – By suitably adjusting f(x), dxf(x) p(x) can range from – to +. Therefore, we have to regularize f(x): we need a smoothness constraint.

  8. If you’re Bayesian, you put a prior over f(x). If you’re a kernel person you demand that dxdyf(x) K-1(x, y) f(y) is in some sense small. K(x, y) is a Kernel. For example, K(x, y) = exp(-(x-y)2/2).

  9. An aside: <f, g> = dxdyf(x) K-1(x, y) g(y).

  10. If you’re Bayesian, you put a prior over f(x). If you’re a kernel person you demand that dxdyf(x) K-1(x, y) f(y) is in some sense small. K(x, y) is a Kernel. For example, K(x, y) = exp(-(x-y)2/2).

  11. This raises two questions: • 1. How do we make sense of K-1(x, y)? • 2. What does • have to do with smoothness? dxdyf(x) K-1(x, y) f(y)

  12. 1. Making sense of K-1(x, y): dyK-1(x, y)K(y, z) = (x – z) defines K-1(x, y). Think of K as a matrix. K-1is its inverse. K has an uncountably infinite number of indices. But otherwise it’s a very standard matrix. K-1 exists if all the eigenvalues of K are positive. An aside: K-1 doesn’t really exist. But that’s irrelevant.

  13. 2. What does dxdyf(x) K-1(x, y) f(y) have to do with smoothness? I’ll answer for a specific case: translation invariant kernels, K(x, y) = K(x – y).

  14. dxdyf(x) K-1(x, y) f(y) • =dxdyf(x) K-1(x–y) f(y) • = dk translation invariance |f(k)|2 Fourier transform K(k) Fourier transform of f(x) Fourier transform of K(x)

  15. dxdyf(x) K-1(x, y) f(y) • =dxdyf(x) K-1(x–y) f(y) • = dk • For smooth Kernels, K(k)falls off rapidly with k. • For the above integral to be small, f(k) must fall off rapidly with k. • In other words, f(k) must be smooth. translation invariance |f(k)|2 Fourier transform K(k)

  16. dxdyf(x) K-1(x, y) f(y) • =dxdyf(x) K-1(x–y) f(y) • = dk • Example: K(x) = exp(-x2/2) • => K(k) exp(-k2/2) |f(k)|2  dk|f(k)|2exp(+k2/2) K(k)

  17. More generally, dyK(x, y)gk(y) = (k) gk(x) => dxdyf(x) K-1(x, y) f(y) = dk Typically, (k) falls of with k gk(x) become increasingly rough with k [dxf(x) gk(x)]2 (k)

  18. Finally, let’s link this to linear algebra dxf(x) g(x)  fg dxf(x) A(x, y) fA(y) =>dxdyf(x) K-1(x, y) f(y) = fK-1f Compare to: ixiyi = xy jAijxj = (Ax)i ijxiAijxj = xA-1x Integrals are glorified sums! -1

  19. Our problem: ~ f = arg minF(fp) ~ f fK-1f is small Two notions of small: d [ F(fp) +  fK-1f ] = 0 df Lagrange multipliers: fK-1f = constant.  fixed an aside: fK-1f can often be thought of as coming from a prior.

  20. d [ F(fp) +  fK-1f ] = 0 df is easy to solve: F(fp) p + 2K-1f = 0 => f = – Remember: p(x) = i (x-xi) => Kp(x) = iK(x, xi) F(fp) Kp 2 1 n 1 n

  21. The more general problem: ~ ~ {f1, f2, …} = arg minF(f1p1, f2p2, …) ~ ~ f1, f2, … the fiK-1fi are small Almost all kernel related problems fall into this class. Those problems are fully specified by: the functional, F(f1p1, f2p2, …), to be minimized what one means by small (e.g., fiK-1fi= ci) The rest is (typically very straightforward) algebra.

  22. Three examples: 1. A “witness” function. 2. Ridge regression. 3. Kernel PCA (which is a little bit different).

  23. 1. A “witness” function. Maximize [f(p-q)]2 subject to the constraint fK-1f = 1 pand q are sums of delta functions. [ ] 2 dxf(x) [p(x) - q(x)] dxdyf(x) K-1(x, y) f(y)

  24. 1. A “witness” function. Maximize: [f(p-q)]2 subject to the constraint: fK-1f = 1 Lagrange multipliers: ( [f(p-q)]2–  fK-1f ) = 0 => f = => [f(p-q)]2 = (p-q)K(p-q) d df K(p-q) [(p-q)K(p-q)]1/2

  25. 1. A “witness” function. pKp = dxdyp(x) K(x, y)p(y) 1 1 j (x-xj) j (y-xj) n n 1 = ijK(xi, xj) n2 We didn’t mention RKHSs We didn’t mention mean embeddings All we did was linear algebra

  26. 1. A “witness” function. ( [f(p-q)]2–  fK-1f ) = 0 => f = => [f(p-q)]2 = (p-q)K(p-q) d df K(p-q) [(p-q)K(p-q)]1/2 ~50% of Arthur’s Gatsby job talk. I do not mean to trivialize the work of kernel people. But I do want to point out that the setup is almost always straightforward.

  27. 2. Ridge regression. minimize i(yi – fpi)2 + fK-1f with respect to f. i labels observations theyi are observed (they’re scalars) we have samples from the distributions pi(x)  is fixed Ridge regression(with a kernel twist).

  28. 2. Ridge regression. solution (very straightforward algebra): f* = iiKpi i = i(B + I)ijyi -1 identity matrix Bij = piKpj 1 = mnK(xm, xn) i j ninj

  29. 2. Ridge regression. solution (very straightforward algebra): f* = iiKpi i = i(B + I)ijyi -1 We didn’t mention RKHSs We didn’t mention mean embeddings All we did was linear algebra

  30. 2. Ridge regression. minimize i(yi – fpi)2 + fK-1f with respect to f f* = i iKpi i = i (B + I)ijyi -1 ~50% of Zoltan’s second to last research talk. I do not mean to trivialize the work of kernel people. But I do want to point out that the setup is almost always straightforward.

  31. 3. Kernel PCA (which is a little bit different). We have a set of points (in, for instance, Euclidean space), zi , i=1, …, n. We want to project them into a higher dimensional space, and do PCA in that space. Why not go to the extreme, and project them into an infinite dimensional space? fi(x) = K(zi, x)

  32. 3. Kernel PCA (which is a little bit different). Now we have a set of points (in function space), fi , i=1, …, n. We want to find a lower dimensional manifold that captures as much variance as possible. If this were standard PCA, we would minimize i (fi - jAijvj)  (fi - jAijvj) with respect to Aijand vj .

  33. 3. Kernel PCA (which is a little bit different). Remember, (fi - kAijvk)  (fi - kAijvk) is shorthand for dx (fi(x) - jAijvj(x)) (fi(x)- jAijvj(x)) But we can mess with the norm to emphasize smoothness, dxdy(fi(x) - jAijvj(x)) Q-1(x,y) (fi(y) - jAijvj(y))

  34. 3. Kernel PCA (which is a little bit different). and minimize i (fi - jAijvj) Q-1 (fi - jAijvj) with respect to Aij and vj. If we set Q = K, we get standard kernel PCA. That’s the most convenient choice, because it makes it easy to compute Aijand vj. I don’t know if there are any other justifications.

  35. Summary Most (almost all?) kernel problems are of the form ~ ~ {f1, f2, …} = arg minF(f1p1, f2p2, …) ~ ~ f1, f2, … the fiK-1fi are small Specify the functional, F(f1p1, f2p2, …), to be minimized what one means by small (e.g., fiK-1fi= ci), and rest is (typically very straightforward) algebra.

  36. The typical problem: The solution (two lines of algebra) d [ F(fp) +  fK-1f ] = 0 df F(fp) Kp f = – 2

  37. There is no reason (I can find) to mention RKHSs or mean embeddings. All quantities one needs arise very naturally as the solution to the problem one has proposed.

More Related