- By
**lieu** - Follow User

- 160 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Low-Rank Kernel Learning with Bregman Matrix Divergences Brian Kulis, Matyas A. Sustik and Inderjit S. Dhillon Journal o' - lieu

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Low-Rank Kernel Learning with Bregman Matrix DivergencesBrian Kulis, Matyas A. Sustik and Inderjit S. DhillonJournal of Machine Learning Research 10 (2009) 341-376

Presented by:

Peng Zhang

4/15/2011

Outline

- Motivation
- Major Contributions
- Preliminaries
- Algorithms
- Discussions
- Experiments
- Conclusions

Motivation

- Low-rank matrix nearness problems
- Learning low-rank positive semidefinite (kernel) matrices for machine learning applications
- Divergence (distance) between data objects
- Find suitable divergence measures to certain matrices
- Efficiency
- Positive semidefinite (PSD, or low rank) matrix is common in machine learning with kernel methods
- Current learning techniques require positive semidefinite constraint, resulting in expensive computations
- Bypass such constraint, find divergences with automatic enforcement of PSD

Major Contributions

- Goal
- Efficient algorithms that can find a PSD (kernel) matrix as ‘close’ as possible to some input PSD matrix under equality or inequality constraints
- Proposals
- Use LogDet divergence/von Neumann divergence constraints in PSD matrix learning
- Use Bregman projections for the divergences
- Computationally efficient, scaling linearly with number of data points n and quadratically with rank of input matrix
- Properties of the proposed algorithms
- Range-space preserving property (rank of output = rank of input)
- Do not decrease rank
- Computationally efficient
- Running times are linear in number of data points and quadratic in the rank of the kernel (for one iteration)

Preliminaries

- Kernel methods
- Inner products in feature space
- Only information needed is kernel matrix K
- K is always PSD
- If is low rank
- Use low rank decomposition to improve computational efficiency

Low rank kernel matrix learning

Intuitively these can be thought of as the difference between the value of F at point x and the value of the first-order Taylor expansion of F around point y evaluated at point x.

Preliminaries- Bregman vector divergences
- Extension to Bregman matrix divergences

Preliminaries

- Special Bregman matrix divergences
- The von Neumann divergence (DvN)
- The LogDet divergence (Dld)

All for full rank matrices

Preliminaries

- Important properties of DvN and Dld
- X is defined over positive definite matrices
- No explicit constrain as positive definite
- Range-space preserving property
- Scale-invariance of LogDet
- Transformation invariance
- Others
- Beyond transductive setting, evaluate kernel function over new data points

Preliminaries

- Spectral Bregman matrix divergence
- Generating convex function
- Function of eigenvalues and convex function
- Bregman matrix divergence by eigenvalues and eigenvectors

Preliminaries

- Kernel matrix learning problem of this paper
- Non-convex
- Convex when using LogDet/von Neumann, because rank is implicitly enforced
- Interested in constraint as squared Euclidean distance between points
- A is rank one, and the problem can be:
- Learn a kernel matrix over all data points from side information (labels or constraints)

Preliminaries

- Bregman projections
- A method to solve the ‘no rank constraint’ version of the previous problem
- Choose one constraint each time
- Perform Bragman projection so that current solution satisfies that constraint
- Using LogDet and von Neumann divergences, projections can be computed efficiently
- Convergence guaranteed, but may require many iterations

Preliminaries

- Bregman divergences for low rank matrices
- Deal with matrices with 0 eigenvalues
- Infinite divergences might occur because
- These imply a rank constraint if the divergence is finite

Range …

Rank …

Preliminaries

- Rank deficient LogDet and von Neumann Divergences
- Rank deficient Bregman projections
- von Neumann:
- LogDet:

Algorithm Using LogDet

- Cyclic projection algorithm using LogDet divergence
- Update for each projection
- Can be simplified to
- Range space is unchanged, no eigen-decomposition required
- (21) costs O(n^2) operations per iteration
- Improving update efficiency with factored n x r matrix G
- This update can be done using Cholesky rank-one update
- O(r^3) complexity
- Further improve update efficiency to O(r^2)
- Combines Cholesky rank-one update with matrix multiplication

Algorithm Using LogDet

- G = LLT; G = G0 B; B is the product of all L matrices from every iteration and X0 = G0G0T
- L can be determined implicitly

Algorithm Using LogDet

- What’re the constraints? Convergence?

O(cr^2)

Convergence is checked by how much v has changed

May require large number of iterations

O(nr^2)

Algorithm Using von Neumann

- Cyclic projection algorithm using von Neumann divergence
- Update for each projection
- This can be modified to
- To calculate , find the unique root of the function

Algorithm Using von Neumann

- Slightly slower than Algorithm 2

Root finder, slows down the process

O(r^2)

Discussions

- Limitations of Algorithm 2 and Algorithm 3
- The initial kernel matrix must be low-rank
- Not applicable for dimensionality reduction
- Number of iterations may be large
- This paper only optimized the computations for each iteration
- Reducing the total number of iterations is future topic
- Handling new data points
- Transductive setting
- All data points are up front
- Some of the points have labels or other supervisions
- When new data point is added, re-learn the entire kernel matrix
- Circumvent
- View B as linear transformation
- Apply B to new points

Discussions

- Generalizations to more constraints
- Slack variables
- When number of constraints is large, no feasible solution to Bregman divergence minimization problem
- Introduce slack variables
- Allows constraints to be violated but penalized
- Similarity constraints
- , or
- Distance constraints
- O(r^2) per projection
- If arbitrary linear constraints are applied, O(nr)

Discussions

- Special cases
- DefiniteBoost optimization problem
- Online-PCA
- Nearest correlation matrix problem
- Minimizing LogDet divergence and semidefinite programming (SDP)
- SDP relaxation of min-balanced-cut problem
- Can be solved by LogDet divergence

Experiments

- Transductive learning and clustering
- Data sets
- Digits
- Handwritten samples of digits 3,8 and 9 from UCI repository
- GyrB
- Protein data set with three bacteria species
- Spambase
- 4601 email messages with 57 attributes, spam/not spam labels
- Nursery
- 12960 instances with 8 attributes and 5 class labels
- Classification
- k-nearest neighbor classifier
- Clustering
- Kernel k-means algorithm
- Use normalized mutual information (NMI) measure

Experiments

- Learn a kernel matrix only using constraints
- Low rank kernels learned by proposed algorithms attain accurate clustering and classification
- Use original data to get initial kernel matrix
- The more constraints used, the more accurate results
- Convergence
- von Neumann divergence
- Convergence was attained in 11 cycles fo 30 constraints and 105 cycles for 420 constraints
- LogDet divergence
- Between 17 and 354 cycles

Simulation Results

Significant improvements

0.948 classification accuracy

For DefiniteBoost, 3220 cycles to convergence

Simulation Results

Rank 57 Rank 8

LogDet needs fewer constraints

LogDet converges much more slowly

(Future work)

But often it has fewer overall running time

Simulation Results

- Metric learning and large scale experiments
- Learning a low-rank kernel with same range-space is equivalent to learning linear transformation of input data
- Compare proposed algorithms with metric learning algorithms
- Metric learning by collapsing classes (MCML)
- Large-margin nearest neighbor metric learning (LMNN)
- Squared Euclidean Baseline

Conclusions

- Developed LogDet/von Neumann divergence based algorithms for low-rank matrix nearness problems
- Running times are linear in number of data points and quadratic in the rank of the kernel
- The algorithms can be used in conjunction with a number of kernel-based learning algorithms

Download Presentation

Connecting to Server..