1 / 17

Discriminative Classifiers

Discriminative Classifiers. Model the classification / decision surface directly (rather than modeling class membership and deriving the decision) LTU (linear threshold unit, also “perceptron”) LMS (least mean square) algorithm Fisher discriminant SVMs

loc
Download Presentation

Discriminative Classifiers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Discriminative Classifiers • Model the classification / decision surface directly(rather than modeling class membership and deriving the decision) • LTU (linear threshold unit, also “perceptron”) • LMS (least mean square) algorithm • Fisher discriminant • SVMs • And now: Logistic Regressionhttp://www.cs.cmu.edu/%7Etom/NewChapters.html CS446-Fall ’06

  2. Logistic Regression • Assume binary classification Y with Pr(Y|X) monotonic with features X Is subject likely to • Suffer a heart attack within 1 yr? Given: Number of previous heart attacks • Be over six feet tall? Given: Gender and heights of parents • Receive an A in CS446? Given: Grade in CS273 • Multivariate and Ordinal are also possible • Characterize whether Y=0 or Y=1 is more likely given X • Odds ratio characterizes a possible decision surface • Assign Y= 0 ifAssign Y=1 otherwise CS446-Fall ’06

  3. Y X  Logistic Regression • Relative class proportion changes with X • For one-dimensional X, Y might look like: • How to model the decision surface? CS446-Fall ’06

  4. Logit Function • Odds ratio: • Model the log of the odds ratio as a linear function of the features • P(Y=1 | X) = 1 – P(Y=0 | X); Let P be P(Y=0 | X) • ln(odds) = ln(P / (1 – P)) = logit(P) • Assuming the logit is linear: ln(P / (1 – P)) = w0 + w1x1 + w2x2 +…+ wnxn • Exponentiate, Multiply by (1-P), Collect like terms: and (1 – P) Not quite standard (usually these are reversed) – remember for later… CS446-Fall ’06

  5. Form of the Probabilities • Thus, we impute a form to P(Y=0 | X) and P(Y=1 | X) • Consider P(Y=0 | X): • At one extreme the exponent approaches - • P(Y=0 | X) approaches 0 • At the other it approaches + • P(Y=0 | X) approaches 1 • Transitions in the middle as does P(Y=1 | X) which is just 1 - P(Y=0 | X) CS446-Fall ’06

  6. Y X  P(Y=1 | X) P(Y=0 | X) Class Membership Probability Functions • We can graph P(Y=0 | X) and P(Y=1 | X) w0=-5w1=15 Classification boundary;Odds ratio is 1 hereLogit is 0: ln(1)=0 CS446-Fall ’06

  7. Modeling Control / Flexibility • The w‘s determine the behavior of the classifier • wi i=1,..n independently control the steepness for each feature • w0 repositions the classification transition • Choose the best w‘s for the training data CS446-Fall ’06

  8. What are the Best w‘s for the Training Data? • The training data are assumed to be independent • So we wantwhere Yl is the class and Xl are the featuresof the l’th training example • Equivilently, to expose the underlying linearity of example independence • Thus, we want the maximum likelihood estimation of w for the training data • Now Mitchell changes representation; so will we: CS446-Fall ’06

  9. Training • Note is invariant under representation change • Consider the sum • We wish to maximize this sum using W • There is no closed form solutionbut we can iterate using the gradient CS446-Fall ’06

  10. Weight Update Rule • Want to maximize • Let be probability of Y=1 given X for current Wso • Also to treat w0 consistently introduce X0 to be 1 • Then the gradient components can be written CS446-Fall ’06

  11. Gradient Interpretation • We can view as the prediction error • This is multiplied by the feature value(this should look familiar) • Weight update rule is a step size / learning rate(this should also look familiar) CS446-Fall ’06

  12. Least Mean Square Iterative Algorithm • Recall from LMS gradient algorithm • Parial of the Error w.r.t. weight i • Yielding the weight update rule: CS446-Fall ’06

  13. Problems - overfitting • What if the training data is linearly separable • What if the margin shrinks but due to a few data points(or just one) • What if the data is not quite linearly separable but due to just a few data points • Recall SVMs • We would like to prefer a large margin • Prefer less steep slopes • Even if it means misclassifying some points CS446-Fall ’06

  14. Regularization • Penalize complexity • What is complexity? The magnitude of W • Optimization problem becomes • Update rule becomes CS446-Fall ’06

  15. The Bayes Optimal ClassifierGetting away from generative modelsOur first ensemble method! • H is a parameterized hypothesis space • hML is the maximum likelihood hypothesis given some data. • Can we do better (higher expected accuracy) than hML? • Yes! We expect hMAP to outperform hML IF… • There is an interesting prior P(h) (ie, not uniform) • Can we do better (higher expected accuracy) than hMAP? • Yes! Bayes Optimal will outperform hMAP IF…(some assumptions) CS446-Fall ’06

  16. Bayes Optimal Classifier Getting another Doctor’s second opinion, another’s third opinion… One doctor is most confident. He is hML One doctor is most reliable / accurate. She is hMAP But she may only be a little more trustworthy than the others. What if hMAP says “+” but *all* other hH say “-”? • If P(hMAP |D) < 0.5, perhaps we should prefer “-” • Think of each hi as casting a weighted vote • Weight each hi by how likely it is to be correct given the training data. • Not just by P(h) which is already reflected in hMAP • Rather by P(h|D) • The most reliable joint opinion may contradict hMAP CS446-Fall ’06

  17. Bayes Optimal Classifier: Example • Assume a space of 3 hypotheses • Given a new instance, assume that h1(x) = 1 h2(x) = 0 h3(x) = 0 • In this case, P(f(x) =1 ) = 0.4 P(f(x) = 0) = 0.6 but hmap(x) =1 • We want to determine the most probable classification by combining the prediction of all hypotheses • We can weight each by its posterior probabilities(there are additional lurking assumptions…) CS446-Fall ’06

More Related