Density ratio estimation a new versatile tool for machine learning l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 50

Density Ratio Estimation: A New Versatile Tool for Machine Learning PowerPoint PPT Presentation


  • 127 Views
  • Uploaded on
  • Presentation posted in: General

Density Ratio Estimation: A New Versatile Tool for Machine Learning. Department of Computer Science Tokyo Institute of Technology Masashi Sugiyama [email protected] http://sugiyama-www.cs.titech.ac.jp/~sugi. Overview of My Talk (1). Consider the ratio of two probability densities .

Download Presentation

Density Ratio Estimation: A New Versatile Tool for Machine Learning

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Density ratio estimation a new versatile tool for machine learning l.jpg

CMU

Density Ratio Estimation:A New Versatile Toolfor Machine Learning

Department of Computer Science

Tokyo Institute of Technology

Masashi Sugiyama

[email protected]

http://sugiyama-www.cs.titech.ac.jp/~sugi


Overview of my talk 1 l.jpg

Overview of My Talk (1)

  • Consider the ratio of two probability densities.

  • If the ratio is estimated, various machine learning problems can be solved!

    • Non-stationarity adaptation, domain adaptation, multi-task learning, outlier detection, change detection in time series, feature selection, dimensionality reduction, independent component analysis, conditional density estimation, classification, two-sample test


Overview of my talk 2 l.jpg

Overview of My Talk (2)

Vapnik said: When solving a problem of interest,

one should not solve a more general problem

as an intermediate step

  • Estimating density ratio is substantially easier than estimating densities!

  • Various direct density ratio estimation methods have been developed recently.

Knowing densities

Knowing ratio


Organization of my talk l.jpg

Organization of My Talk

  • Usage of Density Ratios:

    • Covariate shift adaptation

    • Outlier detection

    • Mutual information estimation

    • Conditional probability estimation

  • Methods of Density Ratio Estimation


Covariate shift adaptation l.jpg

Covariate Shift Adaptation

Shimodaira (JSPI2000)

  • Training/test input distributions are different, but target function remains unchanged:

    “extrapolation”

Input density

Function

Training

samples

Test

samples

Target

function


Adaptation using density ratios l.jpg

Ordinary least-squares is not consistent.

Adaptation Using Density Ratios

  • Density-ratio weighted least-squares is consistent.

Applicable to any likelihood-based methods!


Model selection l.jpg

Model Selection

  • Controlling bias-variance trade-off is important in covariate shift adaptation.

    • No weighting: low-variance, high-bias

    • Density ratio weighting: low-bias, high-variance

  • Density ratio weighting plays an important role for unbiased model selection:

    • Akaike information criterion (regular models)

    • Subspace information criterion (linear models)

    • Cross-validation (arbitrary models)

Shimodaira (JSPI2000)

Sugiyama & Müller (Stat&Dec.2005)

Sugiyama, Krauledat & Müller (JMLR2007)


Active learning l.jpg

Active Learning

  • Covariate shift naturally occurs in active learning scenarios:

    • is designed by users.

    • is determined by environment.

  • Density ratio weighting plays an important role for low-bias active learning.

    • Population-based methods:

    • Pool-based methods:

Wiens (JSPI2000)

Kanamori & Shimodaira (JSPI2003)

Sugiyama, (JMLR2006)

Kanamori, (NeuroComp2007),

Sugiyama & Nakajima (MLJ2009)


Real world applications l.jpg

Real-world Applications

  • Age prediction from faces (with NECsoft):

    • Illumination and angle change (KLS&CV)

  • Speaker identification (with Yamaha):

    • Voice quality change (KLR&CV)

  • Brain-computer interface:

    • Mental condition change (LDA&CV)

Ueki, Sugiyama & Ihara (ICPR2010)

Yamada, Sugiyama & Matsui (SP2010)

Sugiyama, Krauledat & Müller (JMLR2007)

Li, Kambara, Koike & Sugiyama (IEEE-TBE2010)


Real world applications cont l.jpg

Real-world Applications (cont.)

こんな・失敗・は・ご・愛敬・だ・よ・.

  • Japanese text segmentation (with IBM):

    • Domain adaptation from general conversation to medicine (CRF&CV)

  • Semiconductor wafer alignment (with Nikon):

    • Optical marker allocation (AL)

  • Robot control:

    • Dynamic motion update (KLS&CV)

    • Optimal exploration (AL)

Tsuboi, Kashima, Hido, Bickel & Sugiyama (JIP2008)

Sugiyama & Nakajima (MLJ2009)

Hachiya, Akiyama, Sugiyama & Peters (NN2009)

Hachiya, Peters & Sugiyama (ECML2009)

Akiyama, Hachiya, & Sugiyama (NN2010)


Organization of my talk11 l.jpg

Organization of My Talk

  • Usage of Density Ratios:

    • Covariate shift adaptation

    • Outlier detection

    • Mutual information estimation

    • Conditional probability estimation

  • Methods of Density Ratio Estimation


Inlier based outlier detection l.jpg

Inlier-based Outlier Detection

Hido, Tsuboi, Kashima, Sugiyama & Kanamori (ICDM2008)

Smola, Song & Teo (AISTATS2009)

  • Test samples having different tendency from regular ones are regarded as outliers.

Outlier


Real world applications13 l.jpg

Real-world Applications

  • Steel plant diagnosis (with JFE Steel)

  • Printer roller quality control (with Canon)

  • Loan customer inspection (with IBM)

  • Sleep therapy from biometric data

Takimoto, Matsugu & Sugiyama (DMSS2009)

Hido, Tsuboi, Kashima, Sugiyama & Kanamori (KAIS2010)

Kawahara & Sugiyama (SDM2009)


Organization of my talk14 l.jpg

Organization of My Talk

  • Usage of Density Ratios:

    • Covariate shift adaptation

    • Outlier detection

    • Mutual information estimation

    • Conditional probability estimation

  • Methods of Density Ratio Estimation


Mutual information estimation l.jpg

Mutual Information Estimation

  • Mutual information (MI):

  • MI as an independence measure:

  • MI can be approximated using density ratio:

andare

statistically

independent

Suzuki, Sugiyama & Tanaka (ISIT2009)


Usage of mi estimator l.jpg

Usage of MI Estimator

Suzuki, Sugiyama, Sese & Kanamori

(BMC Bioinformatics 2009)

  • Independence between input and output:

    • Feature selection

    • Sufficient dimension reduction

  • Independence between inputs:

    • Independent component analysis

  • Independence between input and residual:

    • Causal inference

Suzuki & Sugiyama (AISTATS2010)

Suzuki & Sugiyama

(ICA2009)

Yamada & Sugiyama (AAAI2010)

y

Causal inference

ε

Feature selection,

Sufficient dimension reduction

x

x’

Independent component analysis


Organization of my talk17 l.jpg

Organization of My Talk

  • Usage of Density Ratios:

    • Covariate shift adaptation

    • Outlier detection

    • Mutual information estimation

    • Conditional probability estimation

  • Methods of Density Ratio Estimation


Conditional probability estimation l.jpg

Conditional Probability Estimation

  • When is continuous:

    • Conditional density estimation

    • Mobile robots’ transition estimation

  • When is categorical:

    • Probabilistic classification

Sugiyama, Takeuchi, Suzuki, Kanamori, Hachiya & Okanohara (AISTATS2010)

80%

Class 1

Pattern

Class 2

20%

Sugiyama (Submitted)


Organization of my talk19 l.jpg

Organization of My Talk

  • Usage of Density Ratios

  • Methods of Density Ratio Estimation:

    • Probabilistic Classification Approach

    • Moment matching Approach

    • Ratio matching Approach

    • Comparison

    • Dimensionality Reduction


Density ratio estimation l.jpg

Density Ratio Estimation

  • Density ratios are shown to be versatile.

  • In practice, however, the density ratio should be estimated from data.

  • Naïve approach: Estimate two densities separately and take their ratio


Vapnik s principle l.jpg

Vapnik’s Principle

When solving a problem, don’t solve

more difficult problems as an intermediate step

  • Estimating density ratio is substantially easier than estimating densities!

  • We estimate density ratio without going through density estimation.

Knowing densities

Knowing ratio


Organization of my talk22 l.jpg

Organization of My Talk

  • Usage of Density Ratios

  • Methods of Density Ratio Estimation:

    • Probabilistic Classification Approach

    • Moment Matching Approach

    • Ratio Matching Approach

    • Comparison

    • Dimensionality Reduction


Probabilistic classification l.jpg

Probabilistic Classification

  • Separate numerator and denominator samples by a probabilistic classifier:

  • Logistic regression achieves the minimum asymptotic variance when model is correct.

Qin (Biometrika1998)


Organization of my talk24 l.jpg

Organization of My Talk

  • Usage of Density Ratios

  • Methods of Density Ratio Estimation:

    • Probabilistic Classification Approach

    • Moment Matching Approach

    • Ratio Matching Approach

    • Comparison

    • Dimensionality Reduction


Moment matching l.jpg

Moment Matching

  • Match moments of and :

    • Ex) Matching the mean:


Kernel mean matching l.jpg

Kernel Mean Matching

Huang, Smola, Gretton, Borgwardt & Schölkopf (NIPS2006)

  • Matching a finite number of moments does not yield the true ratio even asymptotically.

  • Kernel mean matching: All the moments are efficiently matched using Gaussian RKHS:

    • Ratio values at samples can be estimated via quadratic program.

    • Variants for learning entire ratio function and other loss functions are also available.

Kanamori, Suzuki & Sugiyama (arXiv2009)


Organization of my talk27 l.jpg

Organization of My Talk

  • Usage of Density Ratios

  • Methods of Density Ratio Estimation:

    • Probabilistic Classification Approach

    • Moment Matching Approach

    • Ratio Matching Approach

      • Kullback-Leibler Divergence

      • Squared Distance

    • Comparison

    • Dimensionality Reduction


Ratio matching l.jpg

Ratio Matching

Sugiyama, Suzuki & Kanamori (RIMS2010)

  • Match a ratio model with true ratio under the Bregman divergence:

  • Its empirical approximation is given by

(Constant is ignored)


Organization of my talk29 l.jpg

Organization of My Talk

  • Usage of Density Ratios

  • Methods of Density Ratio Estimation:

    • Probabilistic Classification Approach

    • Moment Matching Approach

    • Ratio Matching Approach

      • Kullback-Leibler Divergence

      • Squared Distance

    • Comparison

    • Dimensionality Reduction


Kullback leibler divergence kl l.jpg

Kullback-Leibler Divergence (KL)

Sugiyama, Nakajima, Kashima, von Bünau & Kawanabe (NIPS2007)

Nguyen, Wainwright & Jordan (NIPS2007)

  • Put . Then BR yields

  • For a linear model ,

    minimization of KL is convex.

(ex. Gauss kernel)


Properties l.jpg

Properties

Nguyen, Wainwright & Jordan (NIPS2007)

Sugiyama, Suzuki, Nakajima, Kashima, von Bünau & Kawanabe (AISM2008)

  • Parametric case:

    • Learned parameter converge to the optimal value with order .

  • Non-parametric case:

    • Learned function converges to the optimal function with order slightly slower than (depending on the covering number or the bracketing entropy of the function class).


Experiments l.jpg

Experiments

  • Setup: d-dim. Gaussian with covariance identity and

    • Denominator:mean (0,0,0,…,0)

    • Numerator:mean (1,0,0,…,0)

  • Ratio of kernel density estimators (RKDE):

    • Estimate two densities separately and take ratio.

    • Gaussian width is chosen by CV.

  • KL method:

    • Estimate density ratio directly.

    • Gaussian width is chosen by CV.


Accuracy as a function of input dimensionality l.jpg

Accuracy as a Functionof Input Dimensionality

  • RKDE: “curse of dimensionality”

  • KL: works well

RKDE

Normalized MSE

KL

Dim.


Variations l.jpg

Variations

  • EM algorithms for KL with

    • Log-linear models

    • Gaussian mixture models

    • Probabilistic PCA mixture models

      are also available.

Tsuboi, Kashima, Hido, Bickel & Sugiyama (JIP2009)

Yamada & Sugiyama (IEICE2009)

Yamada, Sugiyama, Wichern & Jaak (Submitted)


Organization of my talk35 l.jpg

Organization of My Talk

  • Usage of Density Ratios

  • Methods of Density Ratio Estimation:

    • Probabilistic Classification Approach

    • Moment Matching Approach

    • Ratio Matching Approach

      • Kullback-Leibler Divergence

      • Squared Distance

    • Comparison

    • Dimensionality Reduction


Squared distance sq l.jpg

Squared Distance (SQ)

Kanamori, Hido & Sugiyama (JMLR2009)

  • Put . Then BR yields

  • For a linear model ,

    minimizer of SQ is analytic:

  • Computationally efficient!


Properties37 l.jpg

Properties

Kanamori, Hido & Sugiyama (JMLR2009)

Kanamori, Suzuki & Sugiyama (arXiv2009)

  • Optimal parametric/non-parametric rates of convergence are achieved.

  • Leave-one-out cross-validation score can be computed analytically.

  • SQ has the smallest condition number.

  • Analytic solution allows computation of the derivative of mutual information estimator:

    • Sufficient dimension reduction

    • Independent component analysis

    • Causal inference


Organization of my talk38 l.jpg

Organization of My Talk

  • Usage of Density Ratios

  • Methods of Density Ratio Estimation:

    • Probabilistic Classification Approach

    • Moment Matching Approach

    • Ratio Matching Approach

    • Comparison

    • Dimensionality Reduction


Density ratio estimation methods l.jpg

Density Ratio Estimation Methods

  • SQ would be preferable in practice.


Organization of my talk40 l.jpg

Organization of My Talk

  • Usage of Density Ratios

  • Methods of Density Ratio Estimation:

    • Probabilistic Classification Approach

    • Moment Matching Approach

    • Ratio Matching Approach

    • Comparison

    • Dimensionality Reduction


Direct density ratio estimation with dimensionality reduction d 3 l.jpg

Direct Density Ratio Estimationwith Dimensionality Reduction (D3)

  • Directly density ratio estimation without going through density estimation is promising.

  • However, for high-dimensional data, density ratio estimation is still challenging.

  • We combine direct density ratio estimation with dimensionality reduction!


Hetero distributional subspace hs l.jpg

Hetero-distributional Subspace (HS)

  • Key assumption: and are different only in a subspace (called HS).

  • This allows us to estimate the density ratio only within the low-dimensional HS!

HS

: Full-rank and orthogonal


Hs search based on supervised dimension reduction l.jpg

HS Search Based onSupervised Dimension Reduction

Sugiyama, Kawanabe & Chui (NN2010)

  • In HS, and are different, so and would be separable.

  • We may use any supervised dimension reduction method for HS search.

  • Local Fisher discriminant analysis (LFDA):

    • Can handle within-class multimodality

    • No limitation on reduced dimensions

    • Analytic solution available

Sugiyama (JMLR2007)


Pros and cons of supervised dimension reduction approach l.jpg

Pros and Cons of SupervisedDimension Reduction Approach

  • Pros:

    • SQ+LFDA gives analytic solution for D3.

  • Cons:

    • Rather restrictive implicit assumption that and are independent.

    • Separability does not always lead to HS (e.g., two overlapped, but different densities)

Separable

HS


Characterization of hs l.jpg

Characterization of HS

Sugiyama, Hara, von Bünau, Suzuki, Kanamori & Kawanabe (SDM2010)

  • HS is given as the maximizer of the Pearson divergence :

  • Thus, we can identify HS by finding maximizer of PE with respect to .

  • PE can be analytically approximated by SQ!


Hs search l.jpg

HS Search

Amari (NeuralComp1998)

  • Gradient descent:

  • Natural gradient descent:

    • Utilize the Stiefel-manifold structure for efficiency

  • Givens rotation:

    • Choose a pair of axes and rotate in the subspace

  • Subspace rotation:

    • Ignore rotation within subspaces for efficiency

Plumbley (NeuroComp2005)

Rotation across

the subspace

Rotation within

the subspace

Hetero-distributional

subspace


Experiments47 l.jpg

Experiments

True ratio (2d)

Samples (2d)

Plain SQ (2d)

Increasing dimensionality

(by adding noisy dims)

Plain SQ

D3-SQ (2d)

D3-SQ


Conclusions l.jpg

Conclusions

  • Many ML tasks can be solved via density ratio estimation:

    • Non-stationarity adaptation, domain adaptation, multi-task learning, outlier detection, change detection in time series, feature selection, dimensionality reduction, independent component analysis, conditional density estimation, classification, two-sample test

  • Directly estimating density ratios without going through density estimation is the key.

    • LR, KMM, KL, SQ, and D3 .


Books l.jpg

Books

  • Quiñonero-Candela, Sugiyama, Schwaighofer & Lawrence (Eds.), Dataset Shift in Machine Learning, MIT Press, 2009.

  • Sugiyama, von Bünau, Kawanabe & Müller, Covariate Shift Adaptation in Machine Learning, MIT Press (coming soon!)

  • Sugiyama, Suzuki & Kanamori, Density Ratio Estimation in Machine Learning, Cambridge University Press (coming soon!)


The world of density ratios l.jpg

The World of Density Ratios

Real-world applications:

Brain-computer interface, Robot control, Image understanding,

Speech recognition, Natural language processing, Bioinformatics

Machine learning algorithms:

Covariate shift adaptation, Outlier detection,

Conditional probability estimation, Mutual information estimation

Density ratio estimation:

Fundamental algorithms (LR, KMM, KL, SQ)

Large-scale, High-dimensionality, Stabilization, Robustification

Theoretical analysis:

Convergence, Information criteria, Numerical stability


  • Login