learning a kernel matrix for nonlinear dimensionality reduction n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Learning a Kernel Matrix for Nonlinear Dimensionality Reduction PowerPoint Presentation
Download Presentation
Learning a Kernel Matrix for Nonlinear Dimensionality Reduction

Loading in 2 Seconds...

play fullscreen
1 / 15

Learning a Kernel Matrix for Nonlinear Dimensionality Reduction - PowerPoint PPT Presentation


  • 191 Views
  • Uploaded on

Learning a Kernel Matrix for Nonlinear Dimensionality Reduction. By K. Weinberger, F. Sha, and L. Saul Presented by Michael Barnathan. The Problem:. Data lies on or near a manifold . Lower dimensionality than overall space. Locally Euclidean.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Learning a Kernel Matrix for Nonlinear Dimensionality Reduction' - xue


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
learning a kernel matrix for nonlinear dimensionality reduction

Learning a Kernel Matrix for Nonlinear Dimensionality Reduction

By K. Weinberger, F. Sha, and L. Saul

Presented by Michael Barnathan

the problem
The Problem:
  • Data lies on or near a manifold.
    • Lower dimensionality than overall space.
    • Locally Euclidean.
    • Example: data on a 2D line in R3, flat area on a sphere.
  • Goal: Learn a kernel that will let us work in the lower-dimensional space.
    • “Unfold” the manifold.
    • First we need to know what it is!
      • Its dimensionality.
      • How it can vary.

2D manifold on a sphere.

(Wikipedia)

background assumptions
Background Assumptions:
  • Kernel Trick
    • Mercer’s Theorem: Continuous, Symmetric, Positive Semi-Definite Kernel Functions can be represented as dot (inner) products in a high-dimensional space (Wikipedia; implied in paper).
    • So we replace the dot product with a kernel function.
      • Or “Gram Matrix”, Knm = φ(xn)T * φ(xm) = k(xn, xm)
      • Kernel provides mapping into high-dimensional space.
      • Consequence of Cover’s theorem: Nonlinear problem then becomes linear.
    • Example: SVMs: xiT * xj -> φ(xi)T * φ(xj) = k(xi, xj).
  • Linear Dimensionality Reduction Techniques:
    • SVD, derived techniques (PCA, ICA, etc.) remove linear correlations.
    • This reduces the dimensionality.
  • Now combine these!
    • Kernel PCA for nonlinear dimensionality reduction!
    • Map input to a higher dimension using a kernel, then use PCA.
the more specific problem
The (More Specific) Problem:
  • Data described by a manifold.
  • Using kernel PCA, discover the manifold.
  • There’s only one detail missing:
  • How do we find the appropriate kernel?
  • This forms the basis of the paper’s approach.
  • It is also a motivation for the paper…
motivation
Motivation:
  • Exploits properties of the data, not just its space.
  • Relates kernel discovery to manifold learning.
    • With the right kernel, kernel PCA will allow us to discover the manifold.
    • So it has implications for both fields.
      • Another paper by the same authors focuses on applicability to manifold learning; this paper focuses on kernel learning.
  • Unlike previous methods, this approach is unsupervised; the kernel is learned automatically.
  • Not specific to PCA; it can learn any kernel.
methodology idea
Methodology – Idea:
  • Semidefinite programming (optimization)
    • Look for a locally isometric mapping from the space to the manifold.
      • Preserves distance, angles between points.
      • Rotation and Translation on a neighborhood.
    • Fix the distance and angles between a point and its k nearest neighbors.
  • Intuition:
    • Represent points as a lattice of “steel balls”.
    • Neighborhoods connected by “rigid rods” that fix angles and distance (local isometry constraint).
    • Now pull the balls as far apart as possible (obj. function).
    • The lattice flattens -> Lower dimensionality!
  • The “balls” and “rods” represent the manifold...
    • If the data is well-sampled (Wikipedia).
    • Shouldn’t be a problem in practice.
optimization constraints
Optimization Constraints:
  • Isometry:
    • For all neighbors xj, xk of point xi.
    • If xj and xk are neighbors of each other or another common point,
    • Let Gram matrices
    • We then have Kii+ Kjj- Kij- Kji= Gii+ Gjj- Gij- Gji.
  • Positive Semidefiniteness (required for kernel trick).
    • No negative eigenvalues.
  • Centered on the origin ().
    • So eigenvalues measure variance of PCs.
    • Dataset can be centered if not already.
objective function
Objective Function
  • We want to maximize pairwise distances.
  • This is an inversion of SSE/MSE!
  • So we have
  • Which is just Tr(K)!
  • Proof: (Not given in paper)
semidefinite embedding sde
Semidefinite Embedding (SDE)
  • Maximize Tr(K) subject to:
    • K ≥ 0
    • Kii+ Kjj- Kij- Kji= Gii+ Gjj- Gij- Gjifor all i,j that are neighbors of each other or a common point.
  • This optimization is convex, and thus has a unique solution.
  • Use semidefinite programming to perform the optimization (no SDP details in paper).
  • Once we have the optimal kernel, perform kPCA.
  • This technique (SDE) is this paper’s contribution.
experimental setup
Experimental Setup
  • Four kernels:
    • SDE (proposed)
    • Linear
    • Polynomial
    • Gaussian
  • “Swiss Roll” Dataset.
    • 23 dimensions.
      • 3 meaningful (top right).
      • 20 filled with small noise (not shown).
    • 800 inputs.
    • k = 4, p = 4, σ = 1.45 (σ of 4-neighborhoods).
  • “Teapot” Dataset.
    • Same teapot, rotated 0 ≤ i < 360 degrees.
    • 23,028 dimensions (76 x 101 x 3).
    • Only one degree of freedom (angle of rotation).
    • 400 inputs.
    • k = 4, p = 4, σ = 1541.
  • “The handwriting dataset”.
    • No dimensionality or parameters specified (16x16x1 = 256D?)
    • 953 images. No images or kernel matrix shown.
results dimensionality reduction
Results – Dimensionality Reduction
  • Two measures:
    • Learned Kernels (SDE):
    • “Eigenspectra”:
      • Variance captured by individual eigenvalues.
      • Normalized by trace (sum of eigenvalues).
      • Seems to indicate manifold dimensionality.

“Swiss Roll”

“Teapot”

“Digits”

results large margin classification
Results – Large Margin Classification
  • Used SDE kernels with SVMs.
  • Results were very poor.
    • Lowering dimensionality can impair separability.

Error rates:

90/10 training/test split.

Mean of 10 experiments.

Decision boundary no longer linearly separable.

strengths and weaknesses
Strengths and Weaknesses
  • Strengths:
    • Unsupervised convex kernel optimization.
    • Generalizes well in theory.
    • Relates manifold learning and kernel learning.
    • Easy to implement; just solve optimization.
    • Intuitive (stretching a string).
  • Weaknesses:
    • May not generalize well in practice (SVMs).
      • Implicit assumption: lower dimensionality is better.
      • Not always the case (as in SVMs due to separability in higher dimensions).
    • Robustness – what if a neighborhood contains an outlier?
    • Offline algorithm – entire gram matrix required.
      • Only a problem if N is large.
    • Paper doesn’t mention SDP details.
      • No algorithm analysis, complexity, etc. Complexity is “relatively high”.
      • In fact, no proof of convergence (according to the authors’ other 2004 paper).
        • Isomap, LLE, et al. already have such proofs.
possible improvements
Possible Improvements
  • Introduce slack variables for robustness.
    • “Rods” not “rigid”, but punished for “bending”.
    • Would introduce a “C” parameter, as in SVMs.
  • Incrementally accept minors of K for large values of N, use incremental kernel PCA.
  • Convolve SDE kernel with others for SVMs?
    • SDE unfolds manifold, other kernel makes the problem linearly separable again.
    • Only makes sense if SDE simplifies the problem.
  • Analyze complexity of SDP.
conclusions
Conclusions
  • Using SDP, SDE can learn kernel matrices to “unfold” data embedded in manifolds.
    • Without requiring parameters.
  • Kernel PCA then reduces dimensionality.
  • Excellent for nonlinear dimensionality reduction / manifold learning.
    • Dramatic results when difference in dimensionalities is high.
  • Poorly suited for SVM classification.