Presented by peng zhang 4 15 2011
Download
1 / 28

Low-Rank Kernel Learning with Bregman Matrix Divergences Brian Kulis, Matyas A. Sustik and Inderjit S. Dhillon Journal o - PowerPoint PPT Presentation


  • 152 Views
  • Uploaded on

Low-Rank Kernel Learning with Bregman Matrix Divergences Brian Kulis, Matyas A. Sustik and Inderjit S. Dhillon Journal of Machine Learning Research 10 (2009) 341-376. Presented by: Peng Zhang 4/15/2011. Outline. Motivation Major Contributions Preliminaries Algorithms Discussions

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Low-Rank Kernel Learning with Bregman Matrix Divergences Brian Kulis, Matyas A. Sustik and Inderjit S. Dhillon Journal o' - lieu


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Presented by peng zhang 4 15 2011

Low-Rank Kernel Learning with Bregman Matrix DivergencesBrian Kulis, Matyas A. Sustik and Inderjit S. DhillonJournal of Machine Learning Research 10 (2009) 341-376

Presented by:

Peng Zhang

4/15/2011


Outline
Outline

  • Motivation

  • Major Contributions

  • Preliminaries

  • Algorithms

  • Discussions

  • Experiments

  • Conclusions


Motivation
Motivation

  • Low-rank matrix nearness problems

    • Learning low-rank positive semidefinite (kernel) matrices for machine learning applications

    • Divergence (distance) between data objects

    • Find suitable divergence measures to certain matrices

      • Efficiency

  • Positive semidefinite (PSD, or low rank) matrix is common in machine learning with kernel methods

    • Current learning techniques require positive semidefinite constraint, resulting in expensive computations

  • Bypass such constraint, find divergences with automatic enforcement of PSD


Major contributions
Major Contributions

  • Goal

    • Efficient algorithms that can find a PSD (kernel) matrix as ‘close’ as possible to some input PSD matrix under equality or inequality constraints

  • Proposals

    • Use LogDet divergence/von Neumann divergence constraints in PSD matrix learning

    • Use Bregman projections for the divergences

      • Computationally efficient, scaling linearly with number of data points n and quadratically with rank of input matrix

  • Properties of the proposed algorithms

    • Range-space preserving property (rank of output = rank of input)

    • Do not decrease rank

    • Computationally efficient

      • Running times are linear in number of data points and quadratic in the rank of the kernel (for one iteration)


Preliminaries
Preliminaries

  • Kernel methods

    • Inner products in feature space

    • Only information needed is kernel matrix K

      • K is always PSD

    • If is low rank

    • Use low rank decomposition to improve computational efficiency

Low rank kernel matrix learning


Preliminaries1

Intuitively these can be thought of as the difference between the value of F at point x and the value of the first-order Taylor expansion of F around point y evaluated at point x.

Preliminaries

  • Bregman vector divergences

  • Extension to Bregman matrix divergences


Preliminaries2
Preliminaries between the value of

  • Special Bregman matrix divergences

    • The von Neumann divergence (DvN)

    • The LogDet divergence (Dld)

All for full rank matrices


Preliminaries3
Preliminaries between the value of

  • Important properties of DvN and Dld

    • X is defined over positive definite matrices

      • No explicit constrain as positive definite

    • Range-space preserving property

    • Scale-invariance of LogDet

    • Transformation invariance

    • Others

      • Beyond transductive setting, evaluate kernel function over new data points


Preliminaries4
Preliminaries between the value of

  • Spectral Bregman matrix divergence

    • Generating convex function

      • Function of eigenvalues and convex function

  • Bregman matrix divergence by eigenvalues and eigenvectors


Preliminaries5
Preliminaries between the value of

  • Kernel matrix learning problem of this paper

    • Non-convex

    • Convex when using LogDet/von Neumann, because rank is implicitly enforced

    • Interested in constraint as squared Euclidean distance between points

    • A is rank one, and the problem can be:

    • Learn a kernel matrix over all data points from side information (labels or constraints)


Preliminaries6
Preliminaries between the value of

  • Bregman projections

    • A method to solve the ‘no rank constraint’ version of the previous problem

      • Choose one constraint each time

      • Perform Bragman projection so that current solution satisfies that constraint

      • Using LogDet and von Neumann divergences, projections can be computed efficiently

      • Convergence guaranteed, but may require many iterations


Preliminaries7
Preliminaries between the value of

  • Bregman divergences for low rank matrices

    • Deal with matrices with 0 eigenvalues

      • Infinite divergences might occur because

      • These imply a rank constraint if the divergence is finite

Range …

Rank …


Preliminaries8
Preliminaries between the value of

  • Rank deficient LogDet and von Neumann Divergences

  • Rank deficient Bregman projections

    • von Neumann:

    • LogDet:


Algorithm using logdet
Algorithm Using LogDet between the value of

  • Cyclic projection algorithm using LogDet divergence

    • Update for each projection

    • Can be simplified to

    • Range space is unchanged, no eigen-decomposition required

    • (21) costs O(n^2) operations per iteration

  • Improving update efficiency with factored n x r matrix G

    • This update can be done using Cholesky rank-one update

    • O(r^3) complexity

  • Further improve update efficiency to O(r^2)

    • Combines Cholesky rank-one update with matrix multiplication


Algorithm using logdet1
Algorithm Using LogDet between the value of

  • G = LLT; G = G0 B; B is the product of all L matrices from every iteration and X0 = G0G0T

  • L can be determined implicitly


Algorithm using logdet2
Algorithm Using LogDet between the value of

  • What’re the constraints? Convergence?

O(cr^2)

Convergence is checked by how much v has changed

May require large number of iterations

O(nr^2)


Algorithm using von neumann
Algorithm Using von Neumann between the value of

  • Cyclic projection algorithm using von Neumann divergence

    • Update for each projection

    • This can be modified to

    • To calculate , find the unique root of the function


Algorithm using von neumann1
Algorithm Using von Neumann between the value of

  • Slightly slower than Algorithm 2

Root finder, slows down the process

O(r^2)


Discussions
Discussions between the value of

  • Limitations of Algorithm 2 and Algorithm 3

    • The initial kernel matrix must be low-rank

      • Not applicable for dimensionality reduction

    • Number of iterations may be large

      • This paper only optimized the computations for each iteration

      • Reducing the total number of iterations is future topic

  • Handling new data points

    • Transductive setting

      • All data points are up front

      • Some of the points have labels or other supervisions

      • When new data point is added, re-learn the entire kernel matrix

    • Circumvent

      • View B as linear transformation

      • Apply B to new points


Discussions1
Discussions between the value of

  • Generalizations to more constraints

    • Slack variables

      • When number of constraints is large, no feasible solution to Bregman divergence minimization problem

      • Introduce slack variables

      • Allows constraints to be violated but penalized

    • Similarity constraints

      • , or

    • Distance constraints

    • O(r^2) per projection

    • If arbitrary linear constraints are applied, O(nr)


Discussions2
Discussions between the value of

  • Special cases

    • DefiniteBoost optimization problem

    • Online-PCA

    • Nearest correlation matrix problem

  • Minimizing LogDet divergence and semidefinite programming (SDP)

    • SDP relaxation of min-balanced-cut problem

    • Can be solved by LogDet divergence


Experiments
Experiments between the value of

  • Transductive learning and clustering

  • Data sets

    • Digits

      • Handwritten samples of digits 3,8 and 9 from UCI repository

    • GyrB

      • Protein data set with three bacteria species

    • Spambase

      • 4601 email messages with 57 attributes, spam/not spam labels

    • Nursery

      • 12960 instances with 8 attributes and 5 class labels

  • Classification

    • k-nearest neighbor classifier

  • Clustering

    • Kernel k-means algorithm

    • Use normalized mutual information (NMI) measure


Experiments1
Experiments between the value of

  • Learn a kernel matrix only using constraints

    • Low rank kernels learned by proposed algorithms attain accurate clustering and classification

    • Use original data to get initial kernel matrix

    • The more constraints used, the more accurate results

    • Convergence

      • von Neumann divergence

        • Convergence was attained in 11 cycles fo 30 constraints and 105 cycles for 420 constraints

      • LogDet divergence

        • Between 17 and 354 cycles


Simulation results
Simulation Results between the value of

Significant improvements

0.948 classification accuracy

For DefiniteBoost, 3220 cycles to convergence


Simulation results1
Simulation Results between the value of

Rank 57 Rank 8

LogDet needs fewer constraints

LogDet converges much more slowly

(Future work)

But often it has fewer overall running time


Simulation results2
Simulation Results between the value of

  • Metric learning and large scale experiments

    • Learning a low-rank kernel with same range-space is equivalent to learning linear transformation of input data

    • Compare proposed algorithms with metric learning algorithms

      • Metric learning by collapsing classes (MCML)

      • Large-margin nearest neighbor metric learning (LMNN)

      • Squared Euclidean Baseline


Conclusions
Conclusions between the value of

  • Developed LogDet/von Neumann divergence based algorithms for low-rank matrix nearness problems

  • Running times are linear in number of data points and quadratic in the rank of the kernel

  • The algorithms can be used in conjunction with a number of kernel-based learning algorithms


Thank you

Thank you between the value of


ad