presented by peng zhang 4 15 2011
Download
Skip this Video
Download Presentation
Presented by: Peng Zhang 4/15/2011

Loading in 2 Seconds...

play fullscreen
1 / 28

Low-Rank Kernel Learning with Bregman Matrix Divergences Brian Kulis, Matyas A. Sustik and Inderjit S. Dhillon Journal o - PowerPoint PPT Presentation


  • 160 Views
  • Uploaded on

Low-Rank Kernel Learning with Bregman Matrix Divergences Brian Kulis, Matyas A. Sustik and Inderjit S. Dhillon Journal of Machine Learning Research 10 (2009) 341-376. Presented by: Peng Zhang 4/15/2011. Outline. Motivation Major Contributions Preliminaries Algorithms Discussions

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Low-Rank Kernel Learning with Bregman Matrix Divergences Brian Kulis, Matyas A. Sustik and Inderjit S. Dhillon Journal o' - lieu


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
presented by peng zhang 4 15 2011

Low-Rank Kernel Learning with Bregman Matrix DivergencesBrian Kulis, Matyas A. Sustik and Inderjit S. DhillonJournal of Machine Learning Research 10 (2009) 341-376

Presented by:

Peng Zhang

4/15/2011

outline
Outline
  • Motivation
  • Major Contributions
  • Preliminaries
  • Algorithms
  • Discussions
  • Experiments
  • Conclusions
motivation
Motivation
  • Low-rank matrix nearness problems
    • Learning low-rank positive semidefinite (kernel) matrices for machine learning applications
    • Divergence (distance) between data objects
    • Find suitable divergence measures to certain matrices
      • Efficiency
  • Positive semidefinite (PSD, or low rank) matrix is common in machine learning with kernel methods
    • Current learning techniques require positive semidefinite constraint, resulting in expensive computations
  • Bypass such constraint, find divergences with automatic enforcement of PSD
major contributions
Major Contributions
  • Goal
    • Efficient algorithms that can find a PSD (kernel) matrix as ‘close’ as possible to some input PSD matrix under equality or inequality constraints
  • Proposals
    • Use LogDet divergence/von Neumann divergence constraints in PSD matrix learning
    • Use Bregman projections for the divergences
      • Computationally efficient, scaling linearly with number of data points n and quadratically with rank of input matrix
  • Properties of the proposed algorithms
    • Range-space preserving property (rank of output = rank of input)
    • Do not decrease rank
    • Computationally efficient
      • Running times are linear in number of data points and quadratic in the rank of the kernel (for one iteration)
preliminaries
Preliminaries
  • Kernel methods
    • Inner products in feature space
    • Only information needed is kernel matrix K
      • K is always PSD
    • If is low rank
    • Use low rank decomposition to improve computational efficiency

Low rank kernel matrix learning

preliminaries1

Intuitively these can be thought of as the difference between the value of F at point x and the value of the first-order Taylor expansion of F around point y evaluated at point x.

Preliminaries
  • Bregman vector divergences
  • Extension to Bregman matrix divergences
preliminaries2
Preliminaries
  • Special Bregman matrix divergences
    • The von Neumann divergence (DvN)
    • The LogDet divergence (Dld)

All for full rank matrices

preliminaries3
Preliminaries
  • Important properties of DvN and Dld
    • X is defined over positive definite matrices
      • No explicit constrain as positive definite
    • Range-space preserving property
    • Scale-invariance of LogDet
    • Transformation invariance
    • Others
      • Beyond transductive setting, evaluate kernel function over new data points
preliminaries4
Preliminaries
  • Spectral Bregman matrix divergence
    • Generating convex function
      • Function of eigenvalues and convex function
  • Bregman matrix divergence by eigenvalues and eigenvectors
preliminaries5
Preliminaries
  • Kernel matrix learning problem of this paper
    • Non-convex
    • Convex when using LogDet/von Neumann, because rank is implicitly enforced
    • Interested in constraint as squared Euclidean distance between points
    • A is rank one, and the problem can be:
    • Learn a kernel matrix over all data points from side information (labels or constraints)
preliminaries6
Preliminaries
  • Bregman projections
    • A method to solve the ‘no rank constraint’ version of the previous problem
      • Choose one constraint each time
      • Perform Bragman projection so that current solution satisfies that constraint
      • Using LogDet and von Neumann divergences, projections can be computed efficiently
      • Convergence guaranteed, but may require many iterations
preliminaries7
Preliminaries
  • Bregman divergences for low rank matrices
    • Deal with matrices with 0 eigenvalues
      • Infinite divergences might occur because
      • These imply a rank constraint if the divergence is finite

Range …

Rank …

preliminaries8
Preliminaries
  • Rank deficient LogDet and von Neumann Divergences
  • Rank deficient Bregman projections
    • von Neumann:
    • LogDet:
algorithm using logdet
Algorithm Using LogDet
  • Cyclic projection algorithm using LogDet divergence
    • Update for each projection
    • Can be simplified to
    • Range space is unchanged, no eigen-decomposition required
    • (21) costs O(n^2) operations per iteration
  • Improving update efficiency with factored n x r matrix G
    • This update can be done using Cholesky rank-one update
    • O(r^3) complexity
  • Further improve update efficiency to O(r^2)
    • Combines Cholesky rank-one update with matrix multiplication
algorithm using logdet1
Algorithm Using LogDet
  • G = LLT; G = G0 B; B is the product of all L matrices from every iteration and X0 = G0G0T
  • L can be determined implicitly
algorithm using logdet2
Algorithm Using LogDet
  • What’re the constraints? Convergence?

O(cr^2)

Convergence is checked by how much v has changed

May require large number of iterations

O(nr^2)

algorithm using von neumann
Algorithm Using von Neumann
  • Cyclic projection algorithm using von Neumann divergence
    • Update for each projection
    • This can be modified to
    • To calculate , find the unique root of the function
algorithm using von neumann1
Algorithm Using von Neumann
  • Slightly slower than Algorithm 2

Root finder, slows down the process

O(r^2)

discussions
Discussions
  • Limitations of Algorithm 2 and Algorithm 3
    • The initial kernel matrix must be low-rank
      • Not applicable for dimensionality reduction
    • Number of iterations may be large
      • This paper only optimized the computations for each iteration
      • Reducing the total number of iterations is future topic
  • Handling new data points
    • Transductive setting
      • All data points are up front
      • Some of the points have labels or other supervisions
      • When new data point is added, re-learn the entire kernel matrix
    • Circumvent
      • View B as linear transformation
      • Apply B to new points
discussions1
Discussions
  • Generalizations to more constraints
    • Slack variables
      • When number of constraints is large, no feasible solution to Bregman divergence minimization problem
      • Introduce slack variables
      • Allows constraints to be violated but penalized
    • Similarity constraints
      • , or
    • Distance constraints
    • O(r^2) per projection
    • If arbitrary linear constraints are applied, O(nr)
discussions2
Discussions
  • Special cases
    • DefiniteBoost optimization problem
    • Online-PCA
    • Nearest correlation matrix problem
  • Minimizing LogDet divergence and semidefinite programming (SDP)
    • SDP relaxation of min-balanced-cut problem
    • Can be solved by LogDet divergence
experiments
Experiments
  • Transductive learning and clustering
  • Data sets
    • Digits
      • Handwritten samples of digits 3,8 and 9 from UCI repository
    • GyrB
      • Protein data set with three bacteria species
    • Spambase
      • 4601 email messages with 57 attributes, spam/not spam labels
    • Nursery
      • 12960 instances with 8 attributes and 5 class labels
  • Classification
    • k-nearest neighbor classifier
  • Clustering
    • Kernel k-means algorithm
    • Use normalized mutual information (NMI) measure
experiments1
Experiments
  • Learn a kernel matrix only using constraints
    • Low rank kernels learned by proposed algorithms attain accurate clustering and classification
    • Use original data to get initial kernel matrix
    • The more constraints used, the more accurate results
    • Convergence
      • von Neumann divergence
        • Convergence was attained in 11 cycles fo 30 constraints and 105 cycles for 420 constraints
      • LogDet divergence
        • Between 17 and 354 cycles
simulation results
Simulation Results

Significant improvements

0.948 classification accuracy

For DefiniteBoost, 3220 cycles to convergence

simulation results1
Simulation Results

Rank 57 Rank 8

LogDet needs fewer constraints

LogDet converges much more slowly

(Future work)

But often it has fewer overall running time

simulation results2
Simulation Results
  • Metric learning and large scale experiments
    • Learning a low-rank kernel with same range-space is equivalent to learning linear transformation of input data
    • Compare proposed algorithms with metric learning algorithms
      • Metric learning by collapsing classes (MCML)
      • Large-margin nearest neighbor metric learning (LMNN)
      • Squared Euclidean Baseline
conclusions
Conclusions
  • Developed LogDet/von Neumann divergence based algorithms for low-rank matrix nearness problems
  • Running times are linear in number of data points and quadratic in the rank of the kernel
  • The algorithms can be used in conjunction with a number of kernel-based learning algorithms
ad