Loading in 5 sec....

Presented by: Peng Zhang 4/15/2011PowerPoint Presentation

Presented by: Peng Zhang 4/15/2011

- 140 Views
- Uploaded on
- Presentation posted in: General

Presented by: Peng Zhang 4/15/2011

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Low-Rank Kernel Learning with Bregman Matrix DivergencesBrian Kulis, Matyas A. Sustik and Inderjit S. DhillonJournal of Machine Learning Research 10 (2009) 341-376

Presented by:

Peng Zhang

4/15/2011

- Motivation
- Major Contributions
- Preliminaries
- Algorithms
- Discussions
- Experiments
- Conclusions

- Low-rank matrix nearness problems
- Learning low-rank positive semidefinite (kernel) matrices for machine learning applications
- Divergence (distance) between data objects
- Find suitable divergence measures to certain matrices
- Efficiency

- Positive semidefinite (PSD, or low rank) matrix is common in machine learning with kernel methods
- Current learning techniques require positive semidefinite constraint, resulting in expensive computations

- Bypass such constraint, find divergences with automatic enforcement of PSD

- Goal
- Efficient algorithms that can find a PSD (kernel) matrix as ‘close’ as possible to some input PSD matrix under equality or inequality constraints

- Proposals
- Use LogDet divergence/von Neumann divergence constraints in PSD matrix learning
- Use Bregman projections for the divergences
- Computationally efficient, scaling linearly with number of data points n and quadratically with rank of input matrix

- Properties of the proposed algorithms
- Range-space preserving property (rank of output = rank of input)
- Do not decrease rank
- Computationally efficient
- Running times are linear in number of data points and quadratic in the rank of the kernel (for one iteration)

- Kernel methods
- Inner products in feature space
- Only information needed is kernel matrix K
- K is always PSD

- If is low rank
- Use low rank decomposition to improve computational efficiency

Low rank kernel matrix learning

Intuitively these can be thought of as the difference between the value of F at point x and the value of the first-order Taylor expansion of F around point y evaluated at point x.

- Bregman vector divergences
- Extension to Bregman matrix divergences

- Special Bregman matrix divergences
- The von Neumann divergence (DvN)
- The LogDet divergence (Dld)

All for full rank matrices

- Important properties of DvN and Dld
- X is defined over positive definite matrices
- No explicit constrain as positive definite

- Range-space preserving property
- Scale-invariance of LogDet
- Transformation invariance
- Others
- Beyond transductive setting, evaluate kernel function over new data points

- X is defined over positive definite matrices

- Spectral Bregman matrix divergence
- Generating convex function
- Function of eigenvalues and convex function

- Generating convex function
- Bregman matrix divergence by eigenvalues and eigenvectors

- Kernel matrix learning problem of this paper
- Non-convex
- Convex when using LogDet/von Neumann, because rank is implicitly enforced
- Interested in constraint as squared Euclidean distance between points
- A is rank one, and the problem can be:
- Learn a kernel matrix over all data points from side information (labels or constraints)

- Bregman projections
- A method to solve the ‘no rank constraint’ version of the previous problem
- Choose one constraint each time
- Perform Bragman projection so that current solution satisfies that constraint
- Using LogDet and von Neumann divergences, projections can be computed efficiently
- Convergence guaranteed, but may require many iterations

- A method to solve the ‘no rank constraint’ version of the previous problem

- Bregman divergences for low rank matrices
- Deal with matrices with 0 eigenvalues
- Infinite divergences might occur because
- These imply a rank constraint if the divergence is finite

- Deal with matrices with 0 eigenvalues

Range …

Rank …

- Rank deficient LogDet and von Neumann Divergences
- Rank deficient Bregman projections
- von Neumann:
- LogDet:

- Cyclic projection algorithm using LogDet divergence
- Update for each projection
- Can be simplified to
- Range space is unchanged, no eigen-decomposition required
- (21) costs O(n^2) operations per iteration

- Improving update efficiency with factored n x r matrix G
- This update can be done using Cholesky rank-one update
- O(r^3) complexity

- Further improve update efficiency to O(r^2)
- Combines Cholesky rank-one update with matrix multiplication

- G = LLT; G = G0 B; B is the product of all L matrices from every iteration and X0 = G0G0T
- L can be determined implicitly

- What’re the constraints? Convergence?

O(cr^2)

Convergence is checked by how much v has changed

May require large number of iterations

O(nr^2)

- Cyclic projection algorithm using von Neumann divergence
- Update for each projection
- This can be modified to
- To calculate , find the unique root of the function

- Slightly slower than Algorithm 2

Root finder, slows down the process

O(r^2)

- Limitations of Algorithm 2 and Algorithm 3
- The initial kernel matrix must be low-rank
- Not applicable for dimensionality reduction

- Number of iterations may be large
- This paper only optimized the computations for each iteration
- Reducing the total number of iterations is future topic

- The initial kernel matrix must be low-rank
- Handling new data points
- Transductive setting
- All data points are up front
- Some of the points have labels or other supervisions
- When new data point is added, re-learn the entire kernel matrix

- Circumvent
- View B as linear transformation
- Apply B to new points

- Transductive setting

- Generalizations to more constraints
- Slack variables
- When number of constraints is large, no feasible solution to Bregman divergence minimization problem
- Introduce slack variables
- Allows constraints to be violated but penalized

- Similarity constraints
- , or

- Distance constraints
- O(r^2) per projection
- If arbitrary linear constraints are applied, O(nr)

- Slack variables

- Special cases
- DefiniteBoost optimization problem
- Online-PCA
- Nearest correlation matrix problem

- Minimizing LogDet divergence and semidefinite programming (SDP)
- SDP relaxation of min-balanced-cut problem
- Can be solved by LogDet divergence

- Transductive learning and clustering
- Data sets
- Digits
- Handwritten samples of digits 3,8 and 9 from UCI repository

- GyrB
- Protein data set with three bacteria species

- Spambase
- 4601 email messages with 57 attributes, spam/not spam labels

- Nursery
- 12960 instances with 8 attributes and 5 class labels

- Digits
- Classification
- k-nearest neighbor classifier

- Clustering
- Kernel k-means algorithm
- Use normalized mutual information (NMI) measure

- Learn a kernel matrix only using constraints
- Low rank kernels learned by proposed algorithms attain accurate clustering and classification
- Use original data to get initial kernel matrix
- The more constraints used, the more accurate results
- Convergence
- von Neumann divergence
- Convergence was attained in 11 cycles fo 30 constraints and 105 cycles for 420 constraints

- LogDet divergence
- Between 17 and 354 cycles

- von Neumann divergence

Significant improvements

0.948 classification accuracy

For DefiniteBoost, 3220 cycles to convergence

Rank 57Rank 8

LogDet needs fewer constraints

LogDet converges much more slowly

(Future work)

But often it has fewer overall running time

- Metric learning and large scale experiments
- Learning a low-rank kernel with same range-space is equivalent to learning linear transformation of input data
- Compare proposed algorithms with metric learning algorithms
- Metric learning by collapsing classes (MCML)
- Large-margin nearest neighbor metric learning (LMNN)
- Squared Euclidean Baseline

- Developed LogDet/von Neumann divergence based algorithms for low-rank matrix nearness problems
- Running times are linear in number of data points and quadratic in the rank of the kernel
- The algorithms can be used in conjunction with a number of kernel-based learning algorithms

Thank you