Loading in 5 sec....

Matrix Factorization Models, Algorithms and ApplicationsPowerPoint Presentation

Matrix Factorization Models, Algorithms and Applications

- By
**jody** - Follow User

- 126 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Matrix Factorization Models, Algorithms and Applications' - jody

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Outline

Outline

Matrix FactorizationModels, Algorithms and Applications

Outline

- Problem Definition
- Overview
- Taxonomy by targeted tasks
- Taxonomy by models
- Taxonomy by algorithms

- Representative work
- Summary and Discussion

Outline

- Problem Definition
- Overview
- Taxonomy by targeted tasks
- Taxonomy by models
- Taxonomy by algorithms

- Representative work
- Summary and Discussion

u(a)

M(a,b)

a

M(a,b)

b

D

b

v(b)

Problem Definition- Matrix Factorization: for a given matrix M, find a compact (low-rank) approximation
- M may be partially observed (i.e., entry missing)
- In the simplest form:
(U,V) = argmin ||M-UTV||F2

- an identity function f(x)= x is used as link function
- U, V and D are interacting in a multiplicative fashion
- D is assumed to be an identity matrix
- Euclidian distance is used as the measure of goodness

b

c

Problem Definition (cont)- Matrices Co-Factorization
For a given set of related matrices {M}, find a coupled set of compact (low-rank) approximations

- Each M represents an interaction observation between two entity
- Multi-View MF:
- Joint MF:

Outline

- Problem Definition
- Overview
- Taxonomy by targeted tasks
- Taxonomy by models
- Taxonomy by algorithms

- Representative work
- Summary and Discussion

Overview: MF taxonomy

- Targeted tasks:
- Dimensionality reduction
- PCA and other spectral algorithms
- LSI and other SVD algorithms
- NMF and other (convex) optimization algorithms
- PPCA and other statistical models

- Clustering
- K-mean, mean-shift, min-cut, normalized-cut, NMF, etc
- Gaussian mixture model, Bi-Gaussian model, etc

- Factor analysis (e.g., profiling, decomposition)
- ICA, CCA, NMF, etc.
- SVD, MMMF, etc.

- Codebook learning
- Sparse coding, k-mean, NMF, LDA, etc.

- Topic modeling
- LSI, LDA, PLSI, etc.

- Graph mining
- Random walk, pagerank, Hits, etc

- Prediction
- Classification, Regression
- Link prediction, matrix completion, community detection
- Collaborative filtering, recommendation, learning to rank
- Domain adaptation, multi-task learning

- Dimensionality reduction

Overview: MF taxonomy

- Models:
- Computational: by optimization
models differ in objective and regularizer design

- Objective:
- L2 error minimization (least square, Frobenius in matrix form)
- L1 error minimization (least absolute deviation)
- Hinge, logistic, log, cosine loss
- Huber loss, ε-loss, etc.
- Information-theoretic loss: entropy, mutual information, KL-divergence
- Exponential family loss and Bregman divergence: logistic, log, etc.
- Graph laplacian and smoothness
- Joint loss of fitting error and prediction accuracy

- Regularizer:
- L2 norm, L1 norm, Ky-Fan (e.g. nuclear)
- Graph laplacian and smoothness
- Lower & upepr bound constraint (e.g., positive constraint)
- Other constraint: linear constraint (e.g., probabilistic wellness), quadratic constraint (e.g., covariance), orthogonal constraint

- Objective:
- Statistic: by inference
model differ in factorization, prior and conditional design

- Factorization:
- Symmetric: p(a,b) = Σzp(z)p(a|z)p(b|z)
- Asymmetric: p(a,b) = p(a)Σzp(z|a)p(b|z)

- Conditional: usually exponential family
- Gaussian, Laplacian, Multinomial, Bernoulli, Poission

- Prior:
- Conjugate prior
- Popularly picked ones: Gaussian, Laplacian (or exponential), Dirichlet
- Non-informative prior: max entropy prior, etc.
- Nonparametric: ARD, Chinese restaurant process, Indian buffet process, etc.

- Factorization:

- Computational: by optimization

Overview: MF taxonomy

- Models:
- Connection between the two lines [Collins et al, NIPS02,Long et al KDD 07, Singh et al, ECML08]

Gaussian

Laplacian

Bernoulli

Exponential family

…

L2

L1

Logistic

Bregman/KL

…

Computational

Statistic

conditional

prior

loss function

regularization

L2

L1

Lap. Smoothness

…

Gaussian

Lap./Exp.

Gaussian Rand Field

…

Overview: MF taxonomy

- Algorithm:
- Deterministic:
- Spectral analysis
- Matrix decomposition: SVD, QR, LU
- Solving linear system
- Optimization: LP, gradient descent, conjugate gradient, quasi-Newton, SDP, etc.
- Alternative coordinate descent
- LARS, IRLS
- EM
- Mean field, Variational Bayesian, Expectation Propagation, collapsed VB
- …

- Stochastic:
- Stochastic gradient descent (back propagation, message passing)
- Monte Carlo: MCMC, Gibbs sampling, collapsed MCMC
- Random walk
- Simulated annealing, annealing EM
- Randomized projection
- …

- Deterministic:

- Problem Definition
- Overview
- Taxonomy by targeted tasks
- Taxonomy by models
- Taxonomy by algorithms

- Representative work
- Summary and Discussion

Representative Work

- Spectral dimensionality reduction / clustering:
- PCA:
- L2 loss + orthogonal constraint
min ||M – UTV||F2, subject to: VTV = I

- Solve by spectral analysis of MTM
- Analogous to factor-based collaborative filtering

- L2 loss + orthogonal constraint
- Laplacian eigenmap:
- Laplacian smoothness + orthogonal constraint
min wij||uiV-ujV||2, subject to: VTV = I

- Graph encodes neighboring info, e.g., heat, kNN
- Analogous to neighbor-based collaborative filtering

- Laplacian smoothness + orthogonal constraint

- PCA:

Representative Work

- Plain CF:
- Factor based (Ky-Fan):
- L2 loss + L2 regularization
min ||M – UTV||F2 + C1||U||F2 +C2||V||F2

- Solve by SVD
- Analogous to LSI, PCA, etc.

- L2 loss + L2 regularization
- Neighbor based (item or query oriented):
- Not explicitly perform factorization, but could be viewed equivalently as
min wij||uiMi-ujMi||2, subject to: Σv uiv = 1, uiv ≥0

- Graph encodes neighboring info, e.g., heat, kNN
- Analogous to k-mean, Laplacian eigenmap

- Not explicitly perform factorization, but could be viewed equivalently as

- Factor based (Ky-Fan):

Representative Work

- Joint factor-neighbor based CF:
[Koren: Factor meets the Neighbors: Scalable and Accurate Collaborative Filtering, KDD’08]

- L2 loss + L2 regularization
- Neighbor graph constructed by pearson correlation
- Solve by stochastic gradient descent
- Analogous to locality (Laplacian) regularizedPCA.

Representative Work

- Max-margin matrix factorization:
- Max-margin dimensionality reduction: [a lot of work here]
- Hinge loss + L2 regularization
min h(yij – uiTDvj)+C1||D||F2+C2||U||F2 +C3||V||F2

- Solve by SDP, cutting plane, etc.

- Hinge loss + L2 regularization
- Max-Margin Matrix Factorization:[Srebro et al, NIPS 2005, ALT 2005]
- Hinge loss + Ky-Fan
min h(mij – uiTvj)+C1||U||F2+C2||V||F2

- Note: no constraint for the rank of U or V
- Solve by SDP

- Hinge loss + Ky-Fan
- CoFi-Rank: [Weimer et al, NIPS 2009]
- NDCG + Ky-Fan
min n(mij – uiTvj)+C1||U||F2+C2||V||F2

- Note: no constraint for the rank of U or V
- Solve by SDP, bundle methods

- NDCG + Ky-Fan

- Max-margin dimensionality reduction: [a lot of work here]

Representative Work

- Sparse coding:
[Lee et al, NIPS 2007] [Lee et al, IJCAI 2009]

- L2 sparse coding :
- L2 loss + L1 regularization
min ||M – UTV||F2+C1||U||1 +C2||V||F2

- Solve by LARS, IRLS, gradient descent with sign searching

- L2 loss + L1 regularization
- Exponential family sparse coding:
- Bregman divergence + L1 regularization
min D(Mab||g(uavb)) + C1||U||1 +C2||V||F2

- Solve by gradient descent with sign searching

- Bregman divergence + L1 regularization
- Sparse is good --- my guess:
- compacter usually implies predictive
- Sparsity poses stronger prior, making local optima more distinguishable
- Shorter descriptive length ( the principal of MDL)

- L2 sparse coding :

Representative Work

- NMF, LDA, and Exponential PCA
- NMF: [Lee et al, NIPS 2001]
- L2 loss + nonnegative constraint
min ||M – UTV||F2, subject to: U≥0, V ≥0

- Solve by SDP, projected gradient descent, interior point

- L2 loss + nonnegative constraint
- LDA: [Blei et al, NIPS 2002]
- Asymmetric + Multinomial conditional + conjugate (Dirichlet) prior
ua~ Dir(α), zab~Disc(ua), Mab~Mult(V, zab)

- Variational Bayesian, EP, Gibbs sampling, collapsed VB/GS

- Asymmetric + Multinomial conditional + conjugate (Dirichlet) prior
- Exponential PCA: [Collins et al, NIPS 2002]
- Bregman divergence + orthogonal constraint
min D(Mab||g(uavb)) , subject to: VTV = I

- Solved by gradient descent

- Bregman divergence + orthogonal constraint
- Essentially, these are equivalent to each other

- NMF: [Lee et al, NIPS 2001]

Representative Work

- Link analysis:
- Factor based / bi-clustering:[a lot of papers in co-clustering and social network analysis]
- L2 loss + L2 regularization
min ||M – UTDV||F2+C1||U||1 +C2||V||F2 +C3||D||F2

- To further simplify, assume diagonal or even identity D
- Modern models use logistic regression

- L2 loss + L2 regularization
- Bayesian Co-clustering [Shan et al ICDM 2008]
Or Mixed membership stochastic block model [Airoldi et al, NIPS 2008]

- Symmetric + Bernoulli conditional + Dirichlet prior
ui~ Dir(α), zi~Disc(ui), Mij~sigmoid(ziTDzj)

- Symmetric + Bernoulli conditional + Dirichlet prior
- Nonparametric feature model:[Miller et al, NIPS 2010]
- Symmetric + Bernoulli conditional + Nonparametric prior
zi~ IBP(α), Mij~sigmoid(ziTDzj)

- Symmetric + Bernoulli conditional + Nonparametric prior
- In essence, equivalent

- Factor based / bi-clustering:[a lot of papers in co-clustering and social network analysis]

Representative Work

- Joint Link & content analysis:
- Collective factorization:
- L2 loss + L2 regularization [Long et al, ICML 2006, AAAI 2008; Zhou et al, WWW 2008]
- Or Laplacian smoothness loss + orthogonal[Zhou et al ICML 2007]
- Shared representation matrix
min ||M – UTDU||F2 +||F – UTB||+C1||U||1 +C2||B||F2 +C3||D||F2

- Relational topic model:[Chang et al, AISTATS 2009, KDD 2009]
- For M: Symmetric + Bernoulli conditional + Dirichlet Prior
- For F: Asymmetric + Multinomial conditional + Dirichlet Prior
- Shared representation matrix
ui~ Dir(α), zif~Disc(ui), Fif~Mult(B, zif) , Mij~sigmoid(ziTDzj),

- Regression based latent factor model:[Agarwal et al, KDD 2009]
- For M: Symmetric + Gaussian conditional + Gaussian Prior
- For F: Linear regression (Gaussian)
zi~Gaussian(BxFi , σI) , Mij~Gaussian(ziTzj),

- fLDA model:[Agarwal et al, WSDM 2009]
- LDA content factorization + Gaussian factorization model

- In essence, equivalent

- Collective factorization:

ui

ui

i

Mijk

i

i

i

j

Mijk

vj

Mijk

vj

Mijk

+

+

j

j

j

k

+

+

wk

k

wk

k

k

Representative Work- Tensor factorization/hypergraph mining and personalized CF:
- Two-way model: [Rendle et al WSDM 2010, WWW 2010]
min ||Mijk – uiTDvj- uiTDwk- vjTDwk||F2 +C(||U|| F2+||V||F2 +||W||F2 +||D||F2 )

- Full factorization: [Symeonidis et al RecSys 2008, Rendle et al KDD 2009]
min ||Mijk – < ui ,vj,wk >||F2 +C(||U|| F2+||V||F2 +||W||F2)

- Two-way model: [Rendle et al WSDM 2010, WWW 2010]

- Problem Definition
- Overview
- Taxonomy by targeted tasks
- Taxonomy by models
- Taxonomy by algorithms

- Representative work
- Summary and Discussion

Summary and discussion

- Recipe for design an MF model:
- Step 1: Understand your task / data:
- What is the goal of my task?
- What is the underlying mechanism in the task?
- Knowledge, patterns, heuristics, clues…

- What data are available to support my task?
- Are all the available data sources reliable and useful to achieve the goal? Any preprocessing/aggregation needed?
- What is the basic characteristic of my data?
- Symmetric, directional
- positive, fractional, centralized, bounded
- positive definite, triangle inequality
- Which distribution is appropriate to interpret my data?

- Any special concerns for the task?
- Task requirement: is there a need for online operation?
- Resources constraint: computational cost, labeled data,…

- Step 1: Understand your task / data:

Summary and discussion

- Recipe for design an MF model:
- Step 2: Choose an appropriate model:
- Computational or statistic?
- Computational models are generally efficient, ease-of-implementation, off-the-shelf blackbox (no need for fancy skills)…
- Statistic models are usually interpretable, robust to overfitting, prior-knowledge-friendly, promising if properly designed…

- If computational:
- Which loss function?
- L2, most popular, most efficient, generally promising
- Evidently heavy noise: L1, Huber, epsilon
- Dominant locality: Laplacian smoothness
- Specific distribution: Bregman divergence (also use a link function)
- Measurable prediction quality: wrapper the prediction objective
- Readily translated knowledge, heuristic, clue:

- What regularization?
- L2, most popular, most efficient
- Any constraints to retain?
- Sparsity: L1
- Dominant locality: Laplacian smoothness
- Readily translated knowledge, heuristic, clue

- Which loss function?

- Computational or statistic?

- Step 2: Choose an appropriate model:

Summary and discussion

- Recipe for design an MF model:
- Step 2: Choose an appropriate model (cont):
- Computational or statistic?
- Computational models are generally efficient, ease-of-implementation, off-the-shelf blackbox (no need for fancy skills)…
- Statistic models are usually interpretable, robust to overfitting, prior-knowledge-friendly, promising if properly designed…

- If statistic:
- How to decompose the joint pdf?
- To reflect the underlying mechanism
- To efficiently parameterize

- What’s the appropriate model for each pdf factor?
- To encode prior knowledge/underlying mechanism
- To reflect the data distribution

- What’s the appropriate prior for Bayesian treatment?
- Conjugate:
- Sparsity: Laplacian, exponential
- Nonparametric prior
- No idea? Choose none or noninformative

- How to decompose the joint pdf?

- Computational or statistic?

- Step 2: Choose an appropriate model (cont):

Summary and discussion

- Recipe for design an MF model:
- Step 3: Choose or derive an algorithm:
- To meet task requirement and/or resource constraints
- To ease implementation
- To achieve the best of the performance
Deterministic:

- Spectral analysis
- Matrix decomposition: SVD, QR, LU
- Solving linear system
- Optimization: LP, gradient descent, conjugate gradient, quasi-Newton, etc.
- Alternative coordinate descent
- LARS, IRLS
- EM
- Mean field, Variational Bayesian, Expectation Propagation, collapsed VB
- …
Stochastic:

- Stochastic gradient descent (back propagation, message passing)
- Monte Carlo: MCMC, Gibbs sampling, collapsed MCMC
- Random walk
- Simulated annealing, annealing EM
- Randomized projection
- …

- Step 3: Choose or derive an algorithm:

Summary and discussion

- Other thoughts
- Link propagation:
- Friendship / correlation
- Preprocessing:
- Propagate S (self-propagation or based on an auxiliary similarity matrix)
- S is required to be a random matrix (positive entries, row sum = 1)

- Postprocessing:
- Propagate P (using S or an auxiliary similarity matrix)
- Both S and P are required to be random matrices

- Link propagation:

?

?

?

Summary and discussion

- Other thoughts
- Smoothness:
- Friendship / neighborhood
- Correlation, same-category

- Smoothness:

More parameters, but could be parameter free

Applying low-pass filtering

Single parameter

Spectral smoothness

Any comments would be appreciated!

Download Presentation

Connecting to Server..