Loading in 5 sec....

Discriminative ClassifiersPowerPoint Presentation

Discriminative Classifiers

- By
**loc** - Follow User

- 140 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Discriminative Classifiers' - loc

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Discriminative Classifiers

- Model the classification / decision surface directly(rather than modeling class membership and deriving the decision)
- LTU (linear threshold unit, also “perceptron”)
- LMS (least mean square) algorithm
- Fisher discriminant
- SVMs
- And now: Logistic Regressionhttp://www.cs.cmu.edu/%7Etom/NewChapters.html

CS446-Fall ’06

Logistic Regression Multivariate and Ordinal are also possible Characterize whether Y=0 or Y=1 is more likely given X Odds ratio characterizes a possible decision surface Assign Y= 0 ifAssign Y=1 otherwise

- Assume binary classification Y with Pr(Y|X) monotonic with features X Is subject likely to
- Suffer a heart attack within 1 yr? Given: Number of previous heart attacks
- Be over six feet tall? Given: Gender and heights of parents
- Receive an A in CS446? Given: Grade in CS273

CS446-Fall ’06

X

Logistic Regression- Relative class proportion changes with X
- For one-dimensional X, Y might look like:
- How to model the decision surface?

CS446-Fall ’06

Logit Function

- Odds ratio:
- Model the log of the odds ratio as a linear function of the features
- P(Y=1 | X) = 1 – P(Y=0 | X); Let P be P(Y=0 | X)
- ln(odds) = ln(P / (1 – P)) = logit(P)
- Assuming the logit is linear: ln(P / (1 – P)) = w0 + w1x1 + w2x2 +…+ wnxn
- Exponentiate, Multiply by (1-P), Collect like terms:
and (1 – P)

Not quite standard (usually these are reversed) – remember for later…

CS446-Fall ’06

Form of the Probabilities

- Thus, we impute a form to P(Y=0 | X) and P(Y=1 | X)
- Consider P(Y=0 | X):
- At one extreme the exponent approaches -
- P(Y=0 | X) approaches 0
- At the other it approaches +
- P(Y=0 | X) approaches 1
- Transitions in the middle as does P(Y=1 | X) which is just 1 - P(Y=0 | X)

CS446-Fall ’06

X

P(Y=1 | X)

P(Y=0 | X)

Class Membership Probability Functions- We can graph P(Y=0 | X) and P(Y=1 | X)

w0=-5w1=15

Classification boundary;Odds ratio is 1 hereLogit is 0: ln(1)=0

CS446-Fall ’06

Modeling Control / Flexibility

- The w‘s determine the behavior of the classifier
- wi i=1,..n independently control the steepness for each feature
- w0 repositions the classification transition
- Choose the best w‘s for the training data

CS446-Fall ’06

What are the Best w‘s for the Training Data?

- The training data are assumed to be independent
- So we wantwhere Yl is the class and Xl are the featuresof the l’th training example
- Equivilently, to expose the underlying linearity of example independence
- Thus, we want the maximum likelihood estimation of w for the training data
- Now Mitchell changes representation; so will we:

CS446-Fall ’06

Training

- Note is invariant under representation change
- Consider the sum
- We wish to maximize this sum using W
- There is no closed form solutionbut we can iterate using the gradient

CS446-Fall ’06

Weight Update Rule

- Want to maximize
- Let be probability of Y=1 given X for current Wso
- Also to treat w0 consistently introduce X0 to be 1
- Then the gradient components can be written

CS446-Fall ’06

Gradient Interpretation

- We can view as the prediction error
- This is multiplied by the feature value(this should look familiar)
- Weight update rule is a step size / learning rate(this should also look familiar)

CS446-Fall ’06

Least Mean Square Iterative Algorithm

- Recall from LMS gradient algorithm
- Parial of the Error w.r.t. weight i
- Yielding the weight update rule:

CS446-Fall ’06

Problems - overfitting

- What if the training data is linearly separable
- What if the margin shrinks but due to a few data points(or just one)
- What if the data is not quite linearly separable but due to just a few data points
- Recall SVMs
- We would like to prefer a large margin
- Prefer less steep slopes
- Even if it means misclassifying some points

CS446-Fall ’06

Regularization

- Penalize complexity
- What is complexity? The magnitude of W
- Optimization problem becomes
- Update rule becomes

CS446-Fall ’06

The Bayes Optimal ClassifierGetting away from generative modelsOur first ensemble method!

- H is a parameterized hypothesis space
- hML is the maximum likelihood hypothesis given some data.
- Can we do better (higher expected accuracy) than hML?
- Yes! We expect hMAP to outperform hML IF…
- There is an interesting prior P(h) (ie, not uniform)
- Can we do better (higher expected accuracy) than hMAP?
- Yes! Bayes Optimal will outperform hMAP IF…(some assumptions)

CS446-Fall ’06

Bayes Optimal Classifier

Getting another Doctor’s second opinion, another’s third opinion…

One doctor is most confident. He is hML

One doctor is most reliable / accurate. She is hMAP

But she may only be a little more trustworthy than the others.

What if hMAP says “+” but *all* other hH say “-”?

- If P(hMAP |D) < 0.5, perhaps we should prefer “-”
- Think of each hi as casting a weighted vote
- Weight each hi by how likely it is to be correct given the training data.
- Not just by P(h) which is already reflected in hMAP
- Rather by P(h|D)

- The most reliable joint opinion may contradict hMAP

CS446-Fall ’06

Bayes Optimal Classifier: Example

- Assume a space of 3 hypotheses
- Given a new instance, assume that
h1(x) = 1 h2(x) = 0 h3(x) = 0

- In this case,
P(f(x) =1 ) = 0.4 P(f(x) = 0) = 0.6 but hmap(x) =1

- We want to determine the most probable classification by combining the prediction of all hypotheses
- We can weight each by its posterior probabilities(there are additional lurking assumptions…)

CS446-Fall ’06

Download Presentation

Connecting to Server..