Minimum information inference
This presentation is the property of its rightful owner.
Sponsored Links
1 / 43

Minimum Information Inference PowerPoint PPT Presentation


  • 63 Views
  • Uploaded on
  • Presentation posted in: General

Minimum Information Inference. Naftali Tishby Amir Globerson ICNC, CSE The Hebrew University TAU, Jan. 2, 2005. Talk outline. Classification with probabilistic models: Generative vs. Discriminative The Minimum Information Principle Generalization error bounds

Download Presentation

Minimum Information Inference

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Minimum information inference

Minimum Information Inference

Naftali Tishby

Amir Globerson

ICNC, CSE

The Hebrew University

TAU, Jan. 2, 2005


Talk outline

Talk outline

  • Classification with probabilistic models: Generative vs. Discriminative

  • The Minimum Information Principle

    • Generalization error bounds

    • Game theoretic motivation

    • Joint typicality

  • The MinMI algorithms

  • Empirical evaluations

  • Related extensions: SDR and IB


The classification problem

The Classification Problem

  • Learn how to classify (complex) observationsX into (simple) classes Y

    • Given labeled examples (xi,yi)

    • Use them to construct a classifiery=g(x)

  • What is a good classifier?

    • Denote by p *(x,y) the true underlying law

    • Want to minimize the generalization error


Minimum information inference

Problem …

Generalization – Can’t be computed directly 

p*(x,y)

y=g(x)

(xi,yi), i=1…n

Observed

Learned

Truth


Choosing a classifier

Choosing a classifier

  • Need to limit search to some set of rules. If every rule is possible we will surely over-fit. Use a family g(x) where  is a parameter.

  • Would be nice if the true rule is in g(x)

  • How do we choose in g(x) ?


Common approach empirical risk minimization

Common approach:Empirical Risk Minimization

  • A reasonable strategy. Find the classifier which minimizes the empirical (sample) error:

  • Not necessarily provides the best generalization, although theoretical bounds exist.

  • Computationally hard to minimize directly. Many works minimize upper bounds on the error.

  • Here we focus on a different strategy.


Probabilistic models for classification

Probabilistic models for classification

  • Had we known p*(x,y) the optimal predictor would be

  • But we don’t know it. We can try to estimate it. Two general approaches: generative and discriminative.


Generative models

Generative Models

  • Assume p(x|y) has some parametric form, e.g. a Gaussian.

  • Each y has a different set of parameters y

  • How do we estimate y, p(y) ? Maximum Likelihood!


Generative models estimation

Generative Models -Estimation

  • Easy to see that p(y) should be set to the empirical frequency of the classes

  • The parameters yobtained by collecting all x values for the class y, and generating a maximum likelihood estimate.


Example gaussians

Example: Gaussians

  • Assume the class conditional distribution is Gaussian

  • Then are the empirical mean

    and variance of the samples in class y.

y=1

y=2


Example na ve bayes

Example: Naïve Bayes

  • Say X=[X1,…,Xn] is an n dimensional observation

  • Assume:

  • Parameters are p(xi=k|y). Calculated by counting how many times xi=k in class y.

  • Empirical means of

    indicator functions:


Generative classifiers advantages

Generative Classifiers: Advantages

  • Sometimes it makes sense to assume a generation process for p(x|y)(e.g. speech or DNA).

  • Estimation is easy. Closed form solutions in many cases (through empirical means).

  • The parameters can be estimated with relatively high confidence from small samples (e.g. empirical mean and variance). See Ng and Jordan (2001).

  • Performance is not bad at all.


Discriminative classifiers

Discriminative Classifiers

  • But, to classify we need onlyp(y|x).

    Why not estimate it directly? Generative classifiers (implicitly) estimate p(x), which is not really needed or known.

  • Assume a parametric form for p(y|x):


Discriminative models estimation

Discriminative Models - Estimation

  • Choose yto maximize conditional likelihood

  • Estimation is usually not in closed form. Requires iterative maximization (gradient methods etc).


Example logistic regresion

Example: logistic regresion

  • Assume p(x|y) are Gaussians with different means and same variances. Then

  • Goal is to estimate ay,by

  • This is called logistic regression. Since the log of the distribution is linear in x


Discriminative na ve bayes

DiscriminativeNaïve Bayes

  • Assuming p(x|y) is in Naïve Bayes class, the discriminative distribution is

  • Similar to Naïve Bayes, but the ψ(x,y) functions are not distributions. This is why we need the additional normalization Z.

  • Also called a conditional first order loglinear model .


Discriminative advantages

Discriminative: Advantages

  • Estimates only the relevant distributions

    (important when X is very complex).

  • Often outperforms generative models for large enough samples (see Ng and Jordan, 2001).

  • Can be shown to minimize an upper bound on the classification error.


The best of both worlds

The best of both worlds…

  • Generative models (often) employ empirical means which are easy and reliable to estimate.

  • But they model each class separately so poor discriminationis obtained.

  • We would like a discriminative approach based on empirical means.


Learning from expected values observations in physics

Learning from Expected values(observations, in physics)

  • Assume we have some “interesting” observables:

  • And we are given their sample empirical means for different classes Y, e.g. class two moments:

  • How can we use this information to build a classifier?

  • Idea: Look for models which yield the observed expectations, but contain no other information.


The maxent approach

The MaxEnt approach

  • The Entropy H(X,Y) is a measure of uncertainty

    (and typicality!)

  • Find the distribution with the given empirical means andmaximum joint entropy H(X,Y) (Jaynes 57, …)

  • “Least Committed” to the observations, most typical.

  • Yield “nice” exponential forms:


Occam s in classification

Occam’s in Classification

  • Minimum assumptions about X and Y imply independence.

  • Because X behaves differently for different Y they cannot be independent

  • How can we quantify their level of dependence ?

p(x|y=1)

p(x|y=2)

m2

m1

X


Mutual information shannon 48

Mutual Information (Shannon 48)

  • The measure of the information shared by two variables

  • X and Y are independent iff I(X;Y)=0

  • Bounds the classification error:

    eBayes<0.5(H(Y)-I(X;Y)). (Hellman and Raviv 1970).

  • Why not minimizeit subject to the observation constraints?


More for mutual information

More for Mutual Information…

  • I(X;Y) - the unique functional (up to units) that quantifies the notion of information in X about Y in a covariant way.

  • Mutual Information is the generating functional for both source coding (minimization) and channel coding (maximization).

  • Quantifies independence in a model free way

  • Has a natural multivariate extension - I(X1,…,Xn).


Minmi problem setting

MinMI: Problem Setting

  • Given a sample (x1,y1),…,(xn,yn)

  • For each y, calculate the

    expected value of (X)

  • Calculate empirical marginal p(y)

  • Find the minimum Mutual Information distribution with the given empirical expected values

  • The valueof the minimum information is precisely the information in the observations!


Minmi formulation

MinMI Formulation

  • The (convex) set of constraints

  • The information minimizing distribution

  • A convex problem. No local minima!


Minimum information inference

pMI

p

  • The problem is convex given p(y) for any empirical means, without specifying p(x).

  • The minimization generates an auxiliary sparse pMI (x): support alignments.


Characterizing

Characterizing

  • The solution form

  • Where (y) are Lagrange multipliers and

  • Via Bayes

  • Can be used for classification. But how do we find it?


Careful i cheated

Careful… I cheated…

  • What if pMI(x)=0 ?

  • No legal pMI(y|x) …

  • But we can still define:

  • Can show that it is subnormalized:

  • And use f(y|x) for classification!

  • Solutions are actually very sparse. Many pMI(x) are zero. “Support Assignments”…


A dual formulation

A dual formulation

  • Using convex duality we can show that MinMI can be formulated as

  • Called a geometric program

  • Strict inequalities for x such that p(x)=0

  • Avoids dealing with p(x) at all!


A generalization bound

-log2 fMI(y|x)

fMI(y|x)

A generalization bound

  • If the estimated means are equal to their true expected values, we can show that the generalization error satisfies

Y=1


A game theoretic interpretation

A Game Theoretic Interpretation

  • Among all distributions in F(a), why choose MinMI?

  • The MinMI classifiers minimizes the worst case loss in the class

  • The loss is an upper bound on generalization error

  • Minimize a worst case upper bound


Minmi and joint typicality

MinMI and Joint Typicality

Given a sequence the probability that another independently drawn sequence: is drawn from their joint distribution,

Is asymptotically

Suggesting Minimum Mutual Information (MinMI) as a general principle for joint (typical) inference.


I projections csiszar 75 amari 82

I-Projections (Csiszar 75, Amari 82,…)

  • The I-projection of a distribution q(x) on a set F

  • For a set defined by linear constraints:

  • Can be calculated using Generalized Iterative Scaling or Gradient methods.

Looks Familiar ?


The minmi algorithm

The MinMI Algorithm

  • Initialize

  • Iterate

    • For all y: Set to be the projection of on

    • Marginalize


The minmi algorithm1

The MinMI Algorithm


Example two moments

Example: Two moments

  • Observations are class conditional mean and variance.

  • MaxEnt solution would be p(X|y) a Gaussian.

  • MinMI solutions are far from Gaussians and discriminate much better.

MaxEnt

MinMI


Example conditional marginals

Example: Conditional Marginals

  • Recall in Naïve Bayes we used the empirical means of:

  • Can use these means for MinMI.


Na ve bayes analogs

Naïve Bayes Analogs

Naïve Bayes

Discriminative 1st Order LogLinear


Experiments

Experiments

  • 12 UCI Datasets. Discrete Features Only

    used singleton marginal constraints.

  • Compared to Naïve Bayes and 1st order LogLinear model.

  • Note: Naïve Bayes and MinMI use exactly the same input. LogLinear regression also approximates p(x) and uses more information.


Generalization error for full sample

Generalization error for full sample


Related ideas

Related ideas

  • Extract the best observables using minimum MI: Sufficient Dimensionality Reduction (SDR)

  • Efficient representations of X with respect to Y:

    The Information Bottleneck approach.

  • Bounding the information in neural codes from very sparse statistics.

  • Statistical extension of Support Vector Machines.


Conclusions

Conclusions

  • MinMI outperforms discriminative model for small sample sizes

  • Outperforms generative model.

  • Presented a method for inferring classifiers based on simple sample means.

  • Unlike generative models, provides generalization guarantees.


  • Login