1 / 17

# Discriminative Classifiers - PowerPoint PPT Presentation

Discriminative Classifiers. Model the classification / decision surface directly (rather than modeling class membership and deriving the decision) LTU (linear threshold unit, also “perceptron”) LMS (least mean square) algorithm Fisher discriminant SVMs

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Discriminative Classifiers' - loc

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

• Model the classification / decision surface directly(rather than modeling class membership and deriving the decision)

• LTU (linear threshold unit, also “perceptron”)

• LMS (least mean square) algorithm

• Fisher discriminant

• SVMs

• And now: Logistic Regressionhttp://www.cs.cmu.edu/%7Etom/NewChapters.html

CS446-Fall ’06

• Assume binary classification Y with Pr(Y|X) monotonic with features X Is subject likely to

• Suffer a heart attack within 1 yr? Given: Number of previous heart attacks

• Be over six feet tall? Given: Gender and heights of parents

• Multivariate and Ordinal are also possible

• Characterize whether Y=0 or Y=1 is more likely given X

• Odds ratio characterizes a possible decision surface

• Assign Y= 0 ifAssign Y=1 otherwise

• CS446-Fall ’06

X 

Logistic Regression

• Relative class proportion changes with X

• For one-dimensional X, Y might look like:

• How to model the decision surface?

CS446-Fall ’06

• Odds ratio:

• Model the log of the odds ratio as a linear function of the features

• P(Y=1 | X) = 1 – P(Y=0 | X); Let P be P(Y=0 | X)

• ln(odds) = ln(P / (1 – P)) = logit(P)

• Assuming the logit is linear: ln(P / (1 – P)) = w0 + w1x1 + w2x2 +…+ wnxn

• Exponentiate, Multiply by (1-P), Collect like terms:

and (1 – P)

Not quite standard (usually these are reversed) – remember for later…

CS446-Fall ’06

• Thus, we impute a form to P(Y=0 | X) and P(Y=1 | X)

• Consider P(Y=0 | X):

• At one extreme the exponent approaches -

• P(Y=0 | X) approaches 0

• At the other it approaches +

• P(Y=0 | X) approaches 1

• Transitions in the middle as does P(Y=1 | X) which is just 1 - P(Y=0 | X)

CS446-Fall ’06

X 

P(Y=1 | X)

P(Y=0 | X)

Class Membership Probability Functions

• We can graph P(Y=0 | X) and P(Y=1 | X)

w0=-5w1=15

Classification boundary;Odds ratio is 1 hereLogit is 0: ln(1)=0

CS446-Fall ’06

• The w‘s determine the behavior of the classifier

• wi i=1,..n independently control the steepness for each feature

• w0 repositions the classification transition

• Choose the best w‘s for the training data

CS446-Fall ’06

What are the Best w‘s for the Training Data?

• The training data are assumed to be independent

• So we wantwhere Yl is the class and Xl are the featuresof the l’th training example

• Equivilently, to expose the underlying linearity of example independence

• Thus, we want the maximum likelihood estimation of w for the training data

• Now Mitchell changes representation; so will we:

CS446-Fall ’06

• Note is invariant under representation change

• Consider the sum

• We wish to maximize this sum using W

• There is no closed form solutionbut we can iterate using the gradient

CS446-Fall ’06

• Want to maximize

• Let be probability of Y=1 given X for current Wso

• Also to treat w0 consistently introduce X0 to be 1

• Then the gradient components can be written

CS446-Fall ’06

• We can view as the prediction error

• This is multiplied by the feature value(this should look familiar)

• Weight update rule is a step size / learning rate(this should also look familiar)

CS446-Fall ’06

• Recall from LMS gradient algorithm

• Parial of the Error w.r.t. weight i

• Yielding the weight update rule:

CS446-Fall ’06

• What if the training data is linearly separable

• What if the margin shrinks but due to a few data points(or just one)

• What if the data is not quite linearly separable but due to just a few data points

• Recall SVMs

• We would like to prefer a large margin

• Prefer less steep slopes

• Even if it means misclassifying some points

CS446-Fall ’06

• Penalize complexity

• What is complexity? The magnitude of W

• Optimization problem becomes

• Update rule becomes

CS446-Fall ’06

The Bayes Optimal ClassifierGetting away from generative modelsOur first ensemble method!

• H is a parameterized hypothesis space

• hML is the maximum likelihood hypothesis given some data.

• Can we do better (higher expected accuracy) than hML?

• Yes! We expect hMAP to outperform hML IF…

• There is an interesting prior P(h) (ie, not uniform)

• Can we do better (higher expected accuracy) than hMAP?

• Yes! Bayes Optimal will outperform hMAP IF…(some assumptions)

CS446-Fall ’06

Getting another Doctor’s second opinion, another’s third opinion…

One doctor is most confident. He is hML

One doctor is most reliable / accurate. She is hMAP

But she may only be a little more trustworthy than the others.

What if hMAP says “+” but *all* other hH say “-”?

• If P(hMAP |D) < 0.5, perhaps we should prefer “-”

• Think of each hi as casting a weighted vote

• Weight each hi by how likely it is to be correct given the training data.

• Not just by P(h) which is already reflected in hMAP

• Rather by P(h|D)

• The most reliable joint opinion may contradict hMAP

CS446-Fall ’06

• Assume a space of 3 hypotheses

• Given a new instance, assume that

h1(x) = 1 h2(x) = 0 h3(x) = 0

• In this case,

P(f(x) =1 ) = 0.4 P(f(x) = 0) = 0.6 but hmap(x) =1

• We want to determine the most probable classification by combining the prediction of all hypotheses

• We can weight each by its posterior probabilities(there are additional lurking assumptions…)

CS446-Fall ’06