Loading in 5 sec....

Density Ratio Estimation: A New Versatile Tool for Machine LearningPowerPoint Presentation

Density Ratio Estimation: A New Versatile Tool for Machine Learning

Download Presentation

Density Ratio Estimation: A New Versatile Tool for Machine Learning

Loading in 2 Seconds...

- 152 Views
- Uploaded on
- Presentation posted in: General

Density Ratio Estimation: A New Versatile Tool for Machine Learning

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

CMU

Density Ratio Estimation:A New Versatile Toolfor Machine Learning

Department of Computer Science

Tokyo Institute of Technology

Masashi Sugiyama

sugi@cs.titech.ac.jp

http://sugiyama-www.cs.titech.ac.jp/~sugi

- Consider the ratio of two probability densities.
- If the ratio is estimated, various machine learning problems can be solved!
- Non-stationarity adaptation, domain adaptation, multi-task learning, outlier detection, change detection in time series, feature selection, dimensionality reduction, independent component analysis, conditional density estimation, classification, two-sample test

Vapnik said: When solving a problem of interest,

one should not solve a more general problem

as an intermediate step

- Estimating density ratio is substantially easier than estimating densities!
- Various direct density ratio estimation methods have been developed recently.

Knowing densities

Knowing ratio

- Usage of Density Ratios:
- Covariate shift adaptation
- Outlier detection
- Mutual information estimation
- Conditional probability estimation

- Methods of Density Ratio Estimation

Shimodaira (JSPI2000)

- Training/test input distributions are different, but target function remains unchanged:
“extrapolation”

Input density

Function

Training

samples

Test

samples

Target

function

Ordinary least-squares is not consistent.

- Density-ratio weighted least-squares is consistent.

Applicable to any likelihood-based methods!

- Controlling bias-variance trade-off is important in covariate shift adaptation.
- No weighting: low-variance, high-bias
- Density ratio weighting: low-bias, high-variance

- Density ratio weighting plays an important role for unbiased model selection:
- Akaike information criterion (regular models)
- Subspace information criterion (linear models)
- Cross-validation (arbitrary models)

Shimodaira (JSPI2000)

Sugiyama & Müller (Stat&Dec.2005)

Sugiyama, Krauledat & Müller (JMLR2007)

- Covariate shift naturally occurs in active learning scenarios:
- is designed by users.
- is determined by environment.

- Density ratio weighting plays an important role for low-bias active learning.
- Population-based methods:
- Pool-based methods:

Wiens (JSPI2000)

Kanamori & Shimodaira (JSPI2003)

Sugiyama, (JMLR2006)

Kanamori, (NeuroComp2007),

Sugiyama & Nakajima (MLJ2009)

- Age prediction from faces (with NECsoft):
- Illumination and angle change (KLS&CV)

- Speaker identification (with Yamaha):
- Voice quality change (KLR&CV)

- Brain-computer interface:
- Mental condition change (LDA&CV)

Ueki, Sugiyama & Ihara (ICPR2010)

Yamada, Sugiyama & Matsui (SP2010)

Sugiyama, Krauledat & Müller (JMLR2007)

Li, Kambara, Koike & Sugiyama (IEEE-TBE2010)

こんな・失敗・は・ご・愛敬・だ・よ・．

- Japanese text segmentation (with IBM):
- Domain adaptation from general conversation to medicine (CRF&CV)

- Semiconductor wafer alignment (with Nikon):
- Optical marker allocation (AL)

- Robot control:
- Dynamic motion update (KLS&CV)
- Optimal exploration (AL)

Tsuboi, Kashima, Hido, Bickel & Sugiyama (JIP2008)

Sugiyama & Nakajima (MLJ2009)

Hachiya, Akiyama, Sugiyama & Peters (NN2009)

Hachiya, Peters & Sugiyama (ECML2009)

Akiyama, Hachiya, & Sugiyama (NN2010)

- Usage of Density Ratios:
- Covariate shift adaptation
- Outlier detection
- Mutual information estimation
- Conditional probability estimation

- Methods of Density Ratio Estimation

Hido, Tsuboi, Kashima, Sugiyama & Kanamori (ICDM2008)

Smola, Song & Teo (AISTATS2009)

- Test samples having different tendency from regular ones are regarded as outliers.

Outlier

- Steel plant diagnosis (with JFE Steel)
- Printer roller quality control (with Canon)
- Loan customer inspection (with IBM)
- Sleep therapy from biometric data

Takimoto, Matsugu & Sugiyama (DMSS2009)

Hido, Tsuboi, Kashima, Sugiyama & Kanamori (KAIS2010)

Kawahara & Sugiyama (SDM2009)

- Usage of Density Ratios:
- Covariate shift adaptation
- Outlier detection
- Mutual information estimation
- Conditional probability estimation

- Methods of Density Ratio Estimation

- Mutual information (MI):
- MI as an independence measure:
- MI can be approximated using density ratio:

andare

statistically

independent

Suzuki, Sugiyama & Tanaka (ISIT2009)

Suzuki, Sugiyama, Sese & Kanamori

(BMC Bioinformatics 2009)

- Independence between input and output:
- Feature selection
- Sufficient dimension reduction

- Independence between inputs:
- Independent component analysis

- Independence between input and residual:
- Causal inference

Suzuki & Sugiyama (AISTATS2010)

Suzuki & Sugiyama

(ICA2009)

Yamada & Sugiyama (AAAI2010)

y

Causal inference

ε

Feature selection,

Sufficient dimension reduction

x

x’

Independent component analysis

- Usage of Density Ratios:
- Covariate shift adaptation
- Outlier detection
- Mutual information estimation
- Conditional probability estimation

- Methods of Density Ratio Estimation

- When is continuous:
- Conditional density estimation
- Mobile robots’ transition estimation

- When is categorical:
- Probabilistic classification

Sugiyama, Takeuchi, Suzuki, Kanamori, Hachiya & Okanohara (AISTATS2010)

80%

Class 1

Pattern

Class 2

20%

Sugiyama (Submitted)

- Usage of Density Ratios
- Methods of Density Ratio Estimation:
- Probabilistic Classification Approach
- Moment matching Approach
- Ratio matching Approach
- Comparison
- Dimensionality Reduction

- Density ratios are shown to be versatile.
- In practice, however, the density ratio should be estimated from data.
- Naïve approach: Estimate two densities separately and take their ratio

When solving a problem, don’t solve

more difficult problems as an intermediate step

- Estimating density ratio is substantially easier than estimating densities!
- We estimate density ratio without going through density estimation.

Knowing densities

Knowing ratio

- Usage of Density Ratios
- Methods of Density Ratio Estimation:
- Probabilistic Classification Approach
- Moment Matching Approach
- Ratio Matching Approach
- Comparison
- Dimensionality Reduction

- Separate numerator and denominator samples by a probabilistic classifier:
- Logistic regression achieves the minimum asymptotic variance when model is correct.

Qin (Biometrika1998)

- Usage of Density Ratios
- Methods of Density Ratio Estimation:
- Probabilistic Classification Approach
- Moment Matching Approach
- Ratio Matching Approach
- Comparison
- Dimensionality Reduction

- Match moments of and :
- Ex) Matching the mean:

Huang, Smola, Gretton, Borgwardt & Schölkopf (NIPS2006)

- Matching a finite number of moments does not yield the true ratio even asymptotically.
- Kernel mean matching: All the moments are efficiently matched using Gaussian RKHS:
- Ratio values at samples can be estimated via quadratic program.
- Variants for learning entire ratio function and other loss functions are also available.

Kanamori, Suzuki & Sugiyama (arXiv2009)

- Usage of Density Ratios
- Methods of Density Ratio Estimation:
- Probabilistic Classification Approach
- Moment Matching Approach
- Ratio Matching Approach
- Kullback-Leibler Divergence
- Squared Distance

- Comparison
- Dimensionality Reduction

Sugiyama, Suzuki & Kanamori (RIMS2010)

- Match a ratio model with true ratio under the Bregman divergence:
- Its empirical approximation is given by

(Constant is ignored)

- Usage of Density Ratios
- Methods of Density Ratio Estimation:
- Probabilistic Classification Approach
- Moment Matching Approach
- Ratio Matching Approach
- Kullback-Leibler Divergence
- Squared Distance

- Comparison
- Dimensionality Reduction

Sugiyama, Nakajima, Kashima, von Bünau & Kawanabe (NIPS2007)

Nguyen, Wainwright & Jordan (NIPS2007)

- Put . Then BR yields
- For a linear model ,
minimization of KL is convex.

(ex. Gauss kernel)

Nguyen, Wainwright & Jordan (NIPS2007)

Sugiyama, Suzuki, Nakajima, Kashima, von Bünau & Kawanabe (AISM2008)

- Parametric case:
- Learned parameter converge to the optimal value with order .

- Non-parametric case:
- Learned function converges to the optimal function with order slightly slower than (depending on the covering number or the bracketing entropy of the function class).

- Setup: d-dim. Gaussian with covariance identity and
- Denominator:mean (0,0,0,…,0)
- Numerator:mean (1,0,0,…,0)

- Ratio of kernel density estimators (RKDE):
- Estimate two densities separately and take ratio.
- Gaussian width is chosen by CV.

- KL method:
- Estimate density ratio directly.
- Gaussian width is chosen by CV.

- RKDE: “curse of dimensionality”
- KL: works well

RKDE

Normalized MSE

KL

Dim.

- EM algorithms for KL with
- Log-linear models
- Gaussian mixture models
- Probabilistic PCA mixture models
are also available.

Tsuboi, Kashima, Hido, Bickel & Sugiyama (JIP2009)

Yamada & Sugiyama (IEICE2009)

Yamada, Sugiyama, Wichern & Jaak (Submitted)

- Usage of Density Ratios
- Methods of Density Ratio Estimation:
- Probabilistic Classification Approach
- Moment Matching Approach
- Ratio Matching Approach
- Kullback-Leibler Divergence
- Squared Distance

- Comparison
- Dimensionality Reduction

Kanamori, Hido & Sugiyama (JMLR2009)

- Put . Then BR yields
- For a linear model ,
minimizer of SQ is analytic:

- Computationally efficient!

Kanamori, Hido & Sugiyama (JMLR2009)

Kanamori, Suzuki & Sugiyama (arXiv2009)

- Optimal parametric/non-parametric rates of convergence are achieved.
- Leave-one-out cross-validation score can be computed analytically.
- SQ has the smallest condition number.
- Analytic solution allows computation of the derivative of mutual information estimator:
- Sufficient dimension reduction
- Independent component analysis
- Causal inference

- Usage of Density Ratios
- Methods of Density Ratio Estimation:
- Probabilistic Classification Approach
- Moment Matching Approach
- Ratio Matching Approach
- Comparison
- Dimensionality Reduction

- SQ would be preferable in practice.

- Usage of Density Ratios
- Methods of Density Ratio Estimation:
- Probabilistic Classification Approach
- Moment Matching Approach
- Ratio Matching Approach
- Comparison
- Dimensionality Reduction

- Directly density ratio estimation without going through density estimation is promising.
- However, for high-dimensional data, density ratio estimation is still challenging.
- We combine direct density ratio estimation with dimensionality reduction!

- Key assumption: and are different only in a subspace (called HS).
- This allows us to estimate the density ratio only within the low-dimensional HS!

HS

: Full-rank and orthogonal

Sugiyama, Kawanabe & Chui (NN2010)

- In HS, and are different, so and would be separable.
- We may use any supervised dimension reduction method for HS search.
- Local Fisher discriminant analysis (LFDA):
- Can handle within-class multimodality
- No limitation on reduced dimensions
- Analytic solution available

Sugiyama (JMLR2007)

- Pros:
- SQ+LFDA gives analytic solution for D3.

- Cons:
- Rather restrictive implicit assumption that and are independent.
- Separability does not always lead to HS (e.g., two overlapped, but different densities)

Separable

HS

Sugiyama, Hara, von Bünau, Suzuki, Kanamori & Kawanabe (SDM2010)

- HS is given as the maximizer of the Pearson divergence :
- Thus, we can identify HS by finding maximizer of PE with respect to .
- PE can be analytically approximated by SQ!

Amari (NeuralComp1998)

- Gradient descent:
- Natural gradient descent:
- Utilize the Stiefel-manifold structure for efficiency

- Givens rotation:
- Choose a pair of axes and rotate in the subspace

- Subspace rotation:
- Ignore rotation within subspaces for efficiency

Plumbley (NeuroComp2005)

Rotation across

the subspace

Rotation within

the subspace

Hetero-distributional

subspace

True ratio (2d)

Samples (2d)

Plain SQ (2d)

Increasing dimensionality

(by adding noisy dims)

Plain SQ

D3-SQ (2d)

D3-SQ

- Many ML tasks can be solved via density ratio estimation:
- Non-stationarity adaptation, domain adaptation, multi-task learning, outlier detection, change detection in time series, feature selection, dimensionality reduction, independent component analysis, conditional density estimation, classification, two-sample test

- Directly estimating density ratios without going through density estimation is the key.
- LR, KMM, KL, SQ, and D3 .

- Quiñonero-Candela, Sugiyama, Schwaighofer & Lawrence (Eds.), Dataset Shift in Machine Learning, MIT Press, 2009.
- Sugiyama, von Bünau, Kawanabe & Müller, Covariate Shift Adaptation in Machine Learning, MIT Press (coming soon!)
- Sugiyama, Suzuki & Kanamori, Density Ratio Estimation in Machine Learning, Cambridge University Press (coming soon!)

Real-world applications:

Brain-computer interface, Robot control, Image understanding,

Speech recognition, Natural language processing, Bioinformatics

Machine learning algorithms:

Covariate shift adaptation, Outlier detection,

Conditional probability estimation, Mutual information estimation

Density ratio estimation:

Fundamental algorithms (LR, KMM, KL, SQ)

Large-scale, High-dimensionality, Stabilization, Robustification

Theoretical analysis:

Convergence, Information criteria, Numerical stability