1 / 50

# Density Ratio Estimation: A New Versatile Tool for Machine Learning - PowerPoint PPT Presentation

Density Ratio Estimation: A New Versatile Tool for Machine Learning. Department of Computer Science Tokyo Institute of Technology Masashi Sugiyama [email protected] http://sugiyama-www.cs.titech.ac.jp/~sugi. Overview of My Talk (1). Consider the ratio of two probability densities .

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Density Ratio Estimation: A New Versatile Tool for Machine Learning' - gur

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Density Ratio Estimation:A New Versatile Toolfor Machine Learning

Department of Computer Science

Tokyo Institute of Technology

Masashi Sugiyama

[email protected]

http://sugiyama-www.cs.titech.ac.jp/~sugi

• Consider the ratio of two probability densities.

• If the ratio is estimated, various machine learning problems can be solved!

• Non-stationarity adaptation, domain adaptation, multi-task learning, outlier detection, change detection in time series, feature selection, dimensionality reduction, independent component analysis, conditional density estimation, classification, two-sample test

Vapnik said: When solving a problem of interest,

one should not solve a more general problem

as an intermediate step

• Estimating density ratio is substantially easier than estimating densities!

• Various direct density ratio estimation methods have been developed recently.

Knowing densities

Knowing ratio

• Usage of Density Ratios:

• Outlier detection

• Mutual information estimation

• Conditional probability estimation

• Methods of Density Ratio Estimation

Shimodaira (JSPI2000)

• Training/test input distributions are different, but target function remains unchanged:

“extrapolation”

Input density

Function

Training

samples

Test

samples

Target

function

Ordinary least-squares is not consistent.

• Density-ratio weighted least-squares is consistent.

Applicable to any likelihood-based methods!

• No weighting: low-variance, high-bias

• Density ratio weighting: low-bias, high-variance

• Density ratio weighting plays an important role for unbiased model selection:

• Akaike information criterion (regular models)

• Subspace information criterion (linear models)

• Cross-validation (arbitrary models)

Shimodaira (JSPI2000)

Sugiyama & Müller (Stat&Dec.2005)

Sugiyama, Krauledat & Müller (JMLR2007)

• Covariate shift naturally occurs in active learning scenarios:

• is designed by users.

• is determined by environment.

• Density ratio weighting plays an important role for low-bias active learning.

• Population-based methods:

• Pool-based methods:

Wiens (JSPI2000)

Kanamori & Shimodaira (JSPI2003)

Sugiyama, (JMLR2006)

Kanamori, (NeuroComp2007),

Sugiyama & Nakajima (MLJ2009)

• Age prediction from faces (with NECsoft):

• Illumination and angle change (KLS&CV)

• Speaker identification (with Yamaha):

• Voice quality change (KLR&CV)

• Brain-computer interface:

• Mental condition change (LDA&CV)

Ueki, Sugiyama & Ihara (ICPR2010)

Sugiyama, Krauledat & Müller (JMLR2007)

Li, Kambara, Koike & Sugiyama (IEEE-TBE2010)

こんな・失敗・は・ご・愛敬・だ・よ・．

• Japanese text segmentation (with IBM):

• Domain adaptation from general conversation to medicine (CRF&CV)

• Semiconductor wafer alignment (with Nikon):

• Optical marker allocation (AL)

• Robot control:

• Dynamic motion update (KLS&CV)

• Optimal exploration (AL)

Tsuboi, Kashima, Hido, Bickel & Sugiyama (JIP2008)

Sugiyama & Nakajima (MLJ2009)

Hachiya, Akiyama, Sugiyama & Peters (NN2009)

Hachiya, Peters & Sugiyama (ECML2009)

Akiyama, Hachiya, & Sugiyama (NN2010)

• Usage of Density Ratios:

• Outlier detection

• Mutual information estimation

• Conditional probability estimation

• Methods of Density Ratio Estimation

Hido, Tsuboi, Kashima, Sugiyama & Kanamori (ICDM2008)

Smola, Song & Teo (AISTATS2009)

• Test samples having different tendency from regular ones are regarded as outliers.

Outlier

• Steel plant diagnosis (with JFE Steel)

• Printer roller quality control (with Canon)

• Loan customer inspection (with IBM)

• Sleep therapy from biometric data

Takimoto, Matsugu & Sugiyama (DMSS2009)

Hido, Tsuboi, Kashima, Sugiyama & Kanamori (KAIS2010)

Kawahara & Sugiyama (SDM2009)

• Usage of Density Ratios:

• Outlier detection

• Mutual information estimation

• Conditional probability estimation

• Methods of Density Ratio Estimation

• Mutual information (MI):

• MI as an independence measure:

• MI can be approximated using density ratio:

andare

statistically

independent

Suzuki, Sugiyama & Tanaka (ISIT2009)

Suzuki, Sugiyama, Sese & Kanamori

(BMC Bioinformatics 2009)

• Independence between input and output:

• Feature selection

• Sufficient dimension reduction

• Independence between inputs:

• Independent component analysis

• Independence between input and residual:

• Causal inference

Suzuki & Sugiyama (AISTATS2010)

Suzuki & Sugiyama

(ICA2009)

y

Causal inference

ε

Feature selection,

Sufficient dimension reduction

x

x’

Independent component analysis

• Usage of Density Ratios:

• Outlier detection

• Mutual information estimation

• Conditional probability estimation

• Methods of Density Ratio Estimation

• When is continuous:

• Conditional density estimation

• Mobile robots’ transition estimation

• When is categorical:

• Probabilistic classification

Sugiyama, Takeuchi, Suzuki, Kanamori, Hachiya & Okanohara (AISTATS2010)

80%

Class 1

Pattern

Class 2

20%

Sugiyama (Submitted)

• Usage of Density Ratios

• Methods of Density Ratio Estimation:

• Probabilistic Classification Approach

• Moment matching Approach

• Ratio matching Approach

• Comparison

• Dimensionality Reduction

• Density ratios are shown to be versatile.

• In practice, however, the density ratio should be estimated from data.

• Naïve approach: Estimate two densities separately and take their ratio

When solving a problem, don’t solve

more difficult problems as an intermediate step

• Estimating density ratio is substantially easier than estimating densities!

• We estimate density ratio without going through density estimation.

Knowing densities

Knowing ratio

• Usage of Density Ratios

• Methods of Density Ratio Estimation:

• Probabilistic Classification Approach

• Moment Matching Approach

• Ratio Matching Approach

• Comparison

• Dimensionality Reduction

• Separate numerator and denominator samples by a probabilistic classifier:

• Logistic regression achieves the minimum asymptotic variance when model is correct.

Qin (Biometrika1998)

• Usage of Density Ratios

• Methods of Density Ratio Estimation:

• Probabilistic Classification Approach

• Moment Matching Approach

• Ratio Matching Approach

• Comparison

• Dimensionality Reduction

• Match moments of and :

• Ex) Matching the mean:

Huang, Smola, Gretton, Borgwardt & Schölkopf (NIPS2006)

• Matching a finite number of moments does not yield the true ratio even asymptotically.

• Kernel mean matching: All the moments are efficiently matched using Gaussian RKHS:

• Ratio values at samples can be estimated via quadratic program.

• Variants for learning entire ratio function and other loss functions are also available.

Kanamori, Suzuki & Sugiyama (arXiv2009)

• Usage of Density Ratios

• Methods of Density Ratio Estimation:

• Probabilistic Classification Approach

• Moment Matching Approach

• Ratio Matching Approach

• Kullback-Leibler Divergence

• Squared Distance

• Comparison

• Dimensionality Reduction

Sugiyama, Suzuki & Kanamori (RIMS2010)

• Match a ratio model with true ratio under the Bregman divergence:

• Its empirical approximation is given by

(Constant is ignored)

• Usage of Density Ratios

• Methods of Density Ratio Estimation:

• Probabilistic Classification Approach

• Moment Matching Approach

• Ratio Matching Approach

• Kullback-Leibler Divergence

• Squared Distance

• Comparison

• Dimensionality Reduction

Sugiyama, Nakajima, Kashima, von Bünau & Kawanabe (NIPS2007)

Nguyen, Wainwright & Jordan (NIPS2007)

• Put . Then BR yields

• For a linear model ,

minimization of KL is convex.

(ex. Gauss kernel)

Nguyen, Wainwright & Jordan (NIPS2007)

Sugiyama, Suzuki, Nakajima, Kashima, von Bünau & Kawanabe (AISM2008)

• Parametric case:

• Learned parameter converge to the optimal value with order .

• Non-parametric case:

• Learned function converges to the optimal function with order slightly slower than (depending on the covering number or the bracketing entropy of the function class).

• Setup: d-dim. Gaussian with covariance identity and

• Denominator: mean (0,0,0,…,0)

• Numerator: mean (1,0,0,…,0)

• Ratio of kernel density estimators (RKDE):

• Estimate two densities separately and take ratio.

• Gaussian width is chosen by CV.

• KL method:

• Estimate density ratio directly.

• Gaussian width is chosen by CV.

Accuracy as a Functionof Input Dimensionality

• RKDE: “curse of dimensionality”

• KL: works well

RKDE

Normalized MSE

KL

Dim.

• EM algorithms for KL with

• Log-linear models

• Gaussian mixture models

• Probabilistic PCA mixture models

are also available.

Tsuboi, Kashima, Hido, Bickel & Sugiyama (JIP2009)

Yamada, Sugiyama, Wichern & Jaak (Submitted)

• Usage of Density Ratios

• Methods of Density Ratio Estimation:

• Probabilistic Classification Approach

• Moment Matching Approach

• Ratio Matching Approach

• Kullback-Leibler Divergence

• Squared Distance

• Comparison

• Dimensionality Reduction

Kanamori, Hido & Sugiyama (JMLR2009)

• Put . Then BR yields

• For a linear model ,

minimizer of SQ is analytic:

• Computationally efficient!

Kanamori, Hido & Sugiyama (JMLR2009)

Kanamori, Suzuki & Sugiyama (arXiv2009)

• Optimal parametric/non-parametric rates of convergence are achieved.

• Leave-one-out cross-validation score can be computed analytically.

• SQ has the smallest condition number.

• Analytic solution allows computation of the derivative of mutual information estimator:

• Sufficient dimension reduction

• Independent component analysis

• Causal inference

• Usage of Density Ratios

• Methods of Density Ratio Estimation:

• Probabilistic Classification Approach

• Moment Matching Approach

• Ratio Matching Approach

• Comparison

• Dimensionality Reduction

• SQ would be preferable in practice.

• Usage of Density Ratios

• Methods of Density Ratio Estimation:

• Probabilistic Classification Approach

• Moment Matching Approach

• Ratio Matching Approach

• Comparison

• Dimensionality Reduction

Direct Density Ratio Estimationwith Dimensionality Reduction (D3)

• Directly density ratio estimation without going through density estimation is promising.

• However, for high-dimensional data, density ratio estimation is still challenging.

• We combine direct density ratio estimation with dimensionality reduction!

• Key assumption: and are different only in a subspace (called HS).

• This allows us to estimate the density ratio only within the low-dimensional HS!

HS

: Full-rank and orthogonal

HS Search Based onSupervised Dimension Reduction

Sugiyama, Kawanabe & Chui (NN2010)

• In HS, and are different, so and would be separable.

• We may use any supervised dimension reduction method for HS search.

• Local Fisher discriminant analysis (LFDA):

• Can handle within-class multimodality

• No limitation on reduced dimensions

• Analytic solution available

Sugiyama (JMLR2007)

Pros and Cons of SupervisedDimension Reduction Approach

• Pros:

• SQ+LFDA gives analytic solution for D3.

• Cons:

• Rather restrictive implicit assumption that and are independent.

• Separability does not always lead to HS (e.g., two overlapped, but different densities)

Separable

HS

Sugiyama, Hara, von Bünau, Suzuki, Kanamori & Kawanabe (SDM2010)

• HS is given as the maximizer of the Pearson divergence :

• Thus, we can identify HS by finding maximizer of PE with respect to .

• PE can be analytically approximated by SQ!

Amari (NeuralComp1998)

• Utilize the Stiefel-manifold structure for efficiency

• Givens rotation:

• Choose a pair of axes and rotate in the subspace

• Subspace rotation:

• Ignore rotation within subspaces for efficiency

Plumbley (NeuroComp2005)

Rotation across

the subspace

Rotation within

the subspace

Hetero-distributional

subspace

True ratio (2d)

Samples (2d)

Plain SQ (2d)

Increasing dimensionality

Plain SQ

D3-SQ (2d)

D3-SQ

• Many ML tasks can be solved via density ratio estimation:

• Non-stationarity adaptation, domain adaptation, multi-task learning, outlier detection, change detection in time series, feature selection, dimensionality reduction, independent component analysis, conditional density estimation, classification, two-sample test

• Directly estimating density ratios without going through density estimation is the key.

• LR, KMM, KL, SQ, and D3 .

• Quiñonero-Candela, Sugiyama, Schwaighofer & Lawrence (Eds.), Dataset Shift in Machine Learning, MIT Press, 2009.

• Sugiyama, von Bünau, Kawanabe & Müller, Covariate Shift Adaptation in Machine Learning, MIT Press (coming soon!)

• Sugiyama, Suzuki & Kanamori, Density Ratio Estimation in Machine Learning, Cambridge University Press (coming soon!)

Real-world applications:

Brain-computer interface, Robot control, Image understanding,

Speech recognition, Natural language processing, Bioinformatics

Machine learning algorithms: