Matrix Factorization
Download
1 / 28

Matrix Factorization Models, Algorithms and Applications - PowerPoint PPT Presentation


  • 129 Views
  • Uploaded on

Matrix Factorization Models, Algorithms and Applications. Outline. Problem Definition Overview Taxonomy by targeted tasks Taxonomy by models Taxonomy by algorithms Representative work Summary and Discussion. Outline. Problem Definition Overview Taxonomy by targeted tasks

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Matrix Factorization Models, Algorithms and Applications' - jody


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Matrix factorization models algorithms and applications

Matrix FactorizationModels, Algorithms and Applications


Outline
Outline

  • Problem Definition

  • Overview

    • Taxonomy by targeted tasks

    • Taxonomy by models

    • Taxonomy by algorithms

  • Representative work

  • Summary and Discussion


Outline1
Outline

  • Problem Definition

  • Overview

    • Taxonomy by targeted tasks

    • Taxonomy by models

    • Taxonomy by algorithms

  • Representative work

  • Summary and Discussion


Problem definition

a

u(a)

M(a,b)

a

M(a,b)

b

D

b

v(b)

Problem Definition

  • Matrix Factorization: for a given matrix M, find a compact (low-rank) approximation

    • M may be partially observed (i.e., entry missing)

    • In the simplest form:

      (U,V) = argmin ||M-UTV||F2

      • an identity function f(x)= x is used as link function

      • U, V and D are interacting in a multiplicative fashion

      • D is assumed to be an identity matrix

      • Euclidian distance is used as the measure of goodness


Problem definition cont

a

b

c

Problem Definition (cont)

  • Matrices Co-Factorization

    For a given set of related matrices {M}, find a coupled set of compact (low-rank) approximations

    • Each M represents an interaction observation between two entity

    • Multi-View MF:

    • Joint MF:


Outline2
Outline

  • Problem Definition

  • Overview

    • Taxonomy by targeted tasks

    • Taxonomy by models

    • Taxonomy by algorithms

  • Representative work

  • Summary and Discussion


Overview mf taxonomy
Overview: MF taxonomy

  • Targeted tasks:

    • Dimensionality reduction

      • PCA and other spectral algorithms

      • LSI and other SVD algorithms

      • NMF and other (convex) optimization algorithms

      • PPCA and other statistical models

    • Clustering

      • K-mean, mean-shift, min-cut, normalized-cut, NMF, etc

      • Gaussian mixture model, Bi-Gaussian model, etc

    • Factor analysis (e.g., profiling, decomposition)

      • ICA, CCA, NMF, etc.

      • SVD, MMMF, etc.

    • Codebook learning

      • Sparse coding, k-mean, NMF, LDA, etc.

    • Topic modeling

      • LSI, LDA, PLSI, etc.

    • Graph mining

      • Random walk, pagerank, Hits, etc

    • Prediction

      • Classification, Regression

      • Link prediction, matrix completion, community detection

      • Collaborative filtering, recommendation, learning to rank

      • Domain adaptation, multi-task learning


Overview mf taxonomy1
Overview: MF taxonomy

  • Models:

    • Computational: by optimization

      models differ in objective and regularizer design

      • Objective:

        • L2 error minimization (least square, Frobenius in matrix form)

        • L1 error minimization (least absolute deviation)

        • Hinge, logistic, log, cosine loss

        • Huber loss, ε-loss, etc.

        • Information-theoretic loss: entropy, mutual information, KL-divergence

        • Exponential family loss and Bregman divergence: logistic, log, etc.

        • Graph laplacian and smoothness

        • Joint loss of fitting error and prediction accuracy

      • Regularizer:

        • L2 norm, L1 norm, Ky-Fan (e.g. nuclear)

        • Graph laplacian and smoothness

        • Lower & upepr bound constraint (e.g., positive constraint)

        • Other constraint: linear constraint (e.g., probabilistic wellness), quadratic constraint (e.g., covariance), orthogonal constraint

    • Statistic: by inference

      model differ in factorization, prior and conditional design

      • Factorization:

        • Symmetric: p(a,b) = Σzp(z)p(a|z)p(b|z)

        • Asymmetric: p(a,b) = p(a)Σzp(z|a)p(b|z)

      • Conditional: usually exponential family

        • Gaussian, Laplacian, Multinomial, Bernoulli, Poission

      • Prior:

        • Conjugate prior

        • Popularly picked ones: Gaussian, Laplacian (or exponential), Dirichlet

        • Non-informative prior: max entropy prior, etc.

        • Nonparametric: ARD, Chinese restaurant process, Indian buffet process, etc.


Overview mf taxonomy2
Overview: MF taxonomy

  • Models:

    • Connection between the two lines [Collins et al, NIPS02,Long et al KDD 07, Singh et al, ECML08]

Gaussian

Laplacian

Bernoulli

Exponential family

L2

L1

Logistic

Bregman/KL

Computational

Statistic

conditional

prior

loss function

regularization

L2

L1

Lap. Smoothness

Gaussian

Lap./Exp.

Gaussian Rand Field


Overview mf taxonomy3
Overview: MF taxonomy

  • Algorithm:

    • Deterministic:

      • Spectral analysis

      • Matrix decomposition: SVD, QR, LU

      • Solving linear system

      • Optimization: LP, gradient descent, conjugate gradient, quasi-Newton, SDP, etc.

      • Alternative coordinate descent

      • LARS, IRLS

      • EM

      • Mean field, Variational Bayesian, Expectation Propagation, collapsed VB

    • Stochastic:

      • Stochastic gradient descent (back propagation, message passing)

      • Monte Carlo: MCMC, Gibbs sampling, collapsed MCMC

      • Random walk

      • Simulated annealing, annealing EM

      • Randomized projection


Outline3
Outline

  • Problem Definition

  • Overview

    • Taxonomy by targeted tasks

    • Taxonomy by models

    • Taxonomy by algorithms

  • Representative work

  • Summary and Discussion


Representative work
Representative Work

  • Spectral dimensionality reduction / clustering:

    • PCA:

      • L2 loss + orthogonal constraint

        min ||M – UTV||F2, subject to: VTV = I

      • Solve by spectral analysis of MTM

      • Analogous to factor-based collaborative filtering

    • Laplacian eigenmap:

      • Laplacian smoothness + orthogonal constraint

        min wij||uiV-ujV||2, subject to: VTV = I

      • Graph encodes neighboring info, e.g., heat, kNN

      • Analogous to neighbor-based collaborative filtering


Representative work1
Representative Work

  • Plain CF:

    • Factor based (Ky-Fan):

      • L2 loss + L2 regularization

        min ||M – UTV||F2 + C1||U||F2 +C2||V||F2

      • Solve by SVD

      • Analogous to LSI, PCA, etc.

    • Neighbor based (item or query oriented):

      • Not explicitly perform factorization, but could be viewed equivalently as

        min wij||uiMi-ujMi||2, subject to: Σv uiv = 1, uiv ≥0

      • Graph encodes neighboring info, e.g., heat, kNN

      • Analogous to k-mean, Laplacian eigenmap


Representative work2
Representative Work

  • Joint factor-neighbor based CF:

    [Koren: Factor meets the Neighbors: Scalable and Accurate Collaborative Filtering, KDD’08]

    • L2 loss + L2 regularization

    • Neighbor graph constructed by pearson correlation

    • Solve by stochastic gradient descent

    • Analogous to locality (Laplacian) regularizedPCA.


Representative work3
Representative Work

  • Max-margin matrix factorization:

    • Max-margin dimensionality reduction: [a lot of work here]

      • Hinge loss + L2 regularization

        min h(yij – uiTDvj)+C1||D||F2+C2||U||F2 +C3||V||F2

      • Solve by SDP, cutting plane, etc.

    • Max-Margin Matrix Factorization:[Srebro et al, NIPS 2005, ALT 2005]

      • Hinge loss + Ky-Fan

        min h(mij – uiTvj)+C1||U||F2+C2||V||F2

      • Note: no constraint for the rank of U or V

      • Solve by SDP

    • CoFi-Rank: [Weimer et al, NIPS 2009]

      • NDCG + Ky-Fan

        min n(mij – uiTvj)+C1||U||F2+C2||V||F2

      • Note: no constraint for the rank of U or V

      • Solve by SDP, bundle methods


Representative work4
Representative Work

  • Sparse coding:

    [Lee et al, NIPS 2007] [Lee et al, IJCAI 2009]

    • L2 sparse coding :

      • L2 loss + L1 regularization

        min ||M – UTV||F2+C1||U||1 +C2||V||F2

      • Solve by LARS, IRLS, gradient descent with sign searching

    • Exponential family sparse coding:

      • Bregman divergence + L1 regularization

        min D(Mab||g(uavb)) + C1||U||1 +C2||V||F2

      • Solve by gradient descent with sign searching

    • Sparse is good --- my guess:

      • compacter usually implies predictive

      • Sparsity poses stronger prior, making local optima more distinguishable

      • Shorter descriptive length ( the principal of MDL)


Representative work5
Representative Work

  • NMF, LDA, and Exponential PCA

    • NMF: [Lee et al, NIPS 2001]

      • L2 loss + nonnegative constraint

        min ||M – UTV||F2, subject to: U≥0, V ≥0

      • Solve by SDP, projected gradient descent, interior point

    • LDA: [Blei et al, NIPS 2002]

      • Asymmetric + Multinomial conditional + conjugate (Dirichlet) prior

        ua~ Dir(α), zab~Disc(ua), Mab~Mult(V, zab)

      • Variational Bayesian, EP, Gibbs sampling, collapsed VB/GS

    • Exponential PCA: [Collins et al, NIPS 2002]

      • Bregman divergence + orthogonal constraint

        min D(Mab||g(uavb)) , subject to: VTV = I

      • Solved by gradient descent

    • Essentially, these are equivalent to each other


Representative work6
Representative Work

  • Link analysis:

    • Factor based / bi-clustering:[a lot of papers in co-clustering and social network analysis]

      • L2 loss + L2 regularization

        min ||M – UTDV||F2+C1||U||1 +C2||V||F2 +C3||D||F2

      • To further simplify, assume diagonal or even identity D

      • Modern models use logistic regression

    • Bayesian Co-clustering [Shan et al ICDM 2008]

      Or Mixed membership stochastic block model [Airoldi et al, NIPS 2008]

      • Symmetric + Bernoulli conditional + Dirichlet prior

        ui~ Dir(α), zi~Disc(ui), Mij~sigmoid(ziTDzj)

    • Nonparametric feature model:[Miller et al, NIPS 2010]

      • Symmetric + Bernoulli conditional + Nonparametric prior

        zi~ IBP(α), Mij~sigmoid(ziTDzj)

    • In essence, equivalent


Representative work7
Representative Work

  • Joint Link & content analysis:

    • Collective factorization:

      • L2 loss + L2 regularization [Long et al, ICML 2006, AAAI 2008; Zhou et al, WWW 2008]

      • Or Laplacian smoothness loss + orthogonal[Zhou et al ICML 2007]

      • Shared representation matrix

        min ||M – UTDU||F2 +||F – UTB||+C1||U||1 +C2||B||F2 +C3||D||F2

    • Relational topic model:[Chang et al, AISTATS 2009, KDD 2009]

      • For M: Symmetric + Bernoulli conditional + Dirichlet Prior

      • For F: Asymmetric + Multinomial conditional + Dirichlet Prior

      • Shared representation matrix

        ui~ Dir(α), zif~Disc(ui), Fif~Mult(B, zif) , Mij~sigmoid(ziTDzj),

    • Regression based latent factor model:[Agarwal et al, KDD 2009]

      • For M: Symmetric + Gaussian conditional + Gaussian Prior

      • For F: Linear regression (Gaussian)

        zi~Gaussian(BxFi , σI) , Mij~Gaussian(ziTzj),

    • fLDA model:[Agarwal et al, WSDM 2009]

      • LDA content factorization + Gaussian factorization model

    • In essence, equivalent


Representative work8

ui

ui

i

Mijk

i

i

i

j

Mijk

vj

Mijk

vj

Mijk

+

+

j

j

j

k

+

+

wk

k

wk

k

k

Representative Work

  • Tensor factorization/hypergraph mining and personalized CF:

    • Two-way model: [Rendle et al WSDM 2010, WWW 2010]

      min ||Mijk – uiTDvj- uiTDwk- vjTDwk||F2 +C(||U|| F2+||V||F2 +||W||F2 +||D||F2 )

    • Full factorization: [Symeonidis et al RecSys 2008, Rendle et al KDD 2009]

      min ||Mijk – < ui ,vj,wk >||F2 +C(||U|| F2+||V||F2 +||W||F2)


Outline4
Outline

  • Problem Definition

  • Overview

    • Taxonomy by targeted tasks

    • Taxonomy by models

    • Taxonomy by algorithms

  • Representative work

  • Summary and Discussion


Summary and discussion
Summary and discussion

  • Recipe for design an MF model:

    • Step 1: Understand your task / data:

      • What is the goal of my task?

      • What is the underlying mechanism in the task?

        • Knowledge, patterns, heuristics, clues…

      • What data are available to support my task?

      • Are all the available data sources reliable and useful to achieve the goal? Any preprocessing/aggregation needed?

      • What is the basic characteristic of my data?

        • Symmetric, directional

        • positive, fractional, centralized, bounded

        • positive definite, triangle inequality

        • Which distribution is appropriate to interpret my data?

      • Any special concerns for the task?

        • Task requirement: is there a need for online operation?

        • Resources constraint: computational cost, labeled data,…


Summary and discussion1
Summary and discussion

  • Recipe for design an MF model:

    • Step 2: Choose an appropriate model:

      • Computational or statistic?

        • Computational models are generally efficient, ease-of-implementation, off-the-shelf blackbox (no need for fancy skills)…

        • Statistic models are usually interpretable, robust to overfitting, prior-knowledge-friendly, promising if properly designed…

      • If computational:

        • Which loss function?

          • L2, most popular, most efficient, generally promising

          • Evidently heavy noise: L1, Huber, epsilon

          • Dominant locality: Laplacian smoothness

          • Specific distribution: Bregman divergence (also use a link function)

          • Measurable prediction quality: wrapper the prediction objective

          • Readily translated knowledge, heuristic, clue:

        • What regularization?

          • L2, most popular, most efficient

          • Any constraints to retain?

          • Sparsity: L1

          • Dominant locality: Laplacian smoothness

          • Readily translated knowledge, heuristic, clue


Summary and discussion2
Summary and discussion

  • Recipe for design an MF model:

    • Step 2: Choose an appropriate model (cont):

      • Computational or statistic?

        • Computational models are generally efficient, ease-of-implementation, off-the-shelf blackbox (no need for fancy skills)…

        • Statistic models are usually interpretable, robust to overfitting, prior-knowledge-friendly, promising if properly designed…

      • If statistic:

        • How to decompose the joint pdf?

          • To reflect the underlying mechanism

          • To efficiently parameterize

        • What’s the appropriate model for each pdf factor?

          • To encode prior knowledge/underlying mechanism

          • To reflect the data distribution

        • What’s the appropriate prior for Bayesian treatment?

          • Conjugate:

          • Sparsity: Laplacian, exponential

          • Nonparametric prior

          • No idea? Choose none or noninformative


Summary and discussion3
Summary and discussion

  • Recipe for design an MF model:

    • Step 3: Choose or derive an algorithm:

      • To meet task requirement and/or resource constraints

      • To ease implementation

      • To achieve the best of the performance

        Deterministic:

      • Spectral analysis

      • Matrix decomposition: SVD, QR, LU

      • Solving linear system

      • Optimization: LP, gradient descent, conjugate gradient, quasi-Newton, etc.

      • Alternative coordinate descent

      • LARS, IRLS

      • EM

      • Mean field, Variational Bayesian, Expectation Propagation, collapsed VB

      • Stochastic:

      • Stochastic gradient descent (back propagation, message passing)

      • Monte Carlo: MCMC, Gibbs sampling, collapsed MCMC

      • Random walk

      • Simulated annealing, annealing EM

      • Randomized projection


Summary and discussion4
Summary and discussion

  • Other thoughts

    • Link propagation:

      • Friendship / correlation

      • Preprocessing:

        • Propagate S (self-propagation or based on an auxiliary similarity matrix)

        • S is required to be a random matrix (positive entries, row sum = 1)

      • Postprocessing:

        • Propagate P (using S or an auxiliary similarity matrix)

        • Both S and P are required to be random matrices

?

?

?


Summary and discussion5
Summary and discussion

  • Other thoughts

    • Smoothness:

      • Friendship / neighborhood

      • Correlation, same-category

More parameters, but could be parameter free

Applying low-pass filtering

Single parameter

Spectral smoothness


Matrix factorization models algorithms and applications

Thanks!

Any comments would be appreciated!