Discriminative classifiers
Download
1 / 17

Discriminative Classifiers - PowerPoint PPT Presentation


  • 140 Views
  • Uploaded on

Discriminative Classifiers. Model the classification / decision surface directly (rather than modeling class membership and deriving the decision) LTU (linear threshold unit, also “perceptron”) LMS (least mean square) algorithm Fisher discriminant SVMs

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Discriminative Classifiers' - loc


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Discriminative classifiers
Discriminative Classifiers

  • Model the classification / decision surface directly(rather than modeling class membership and deriving the decision)

  • LTU (linear threshold unit, also “perceptron”)

  • LMS (least mean square) algorithm

  • Fisher discriminant

  • SVMs

  • And now: Logistic Regressionhttp://www.cs.cmu.edu/%7Etom/NewChapters.html

CS446-Fall ’06


Logistic regression
Logistic Regression

  • Assume binary classification Y with Pr(Y|X) monotonic with features X Is subject likely to

    • Suffer a heart attack within 1 yr? Given: Number of previous heart attacks

    • Be over six feet tall? Given: Gender and heights of parents

    • Receive an A in CS446? Given: Grade in CS273

  • Multivariate and Ordinal are also possible

  • Characterize whether Y=0 or Y=1 is more likely given X

  • Odds ratio characterizes a possible decision surface

  • Assign Y= 0 ifAssign Y=1 otherwise

  • CS446-Fall ’06


    Logistic regression1

    Y

    X 

    Logistic Regression

    • Relative class proportion changes with X

    • For one-dimensional X, Y might look like:

    • How to model the decision surface?

    CS446-Fall ’06


    Logit function
    Logit Function

    • Odds ratio:

    • Model the log of the odds ratio as a linear function of the features

    • P(Y=1 | X) = 1 – P(Y=0 | X); Let P be P(Y=0 | X)

    • ln(odds) = ln(P / (1 – P)) = logit(P)

    • Assuming the logit is linear: ln(P / (1 – P)) = w0 + w1x1 + w2x2 +…+ wnxn

    • Exponentiate, Multiply by (1-P), Collect like terms:

      and (1 – P)

    Not quite standard (usually these are reversed) – remember for later…

    CS446-Fall ’06


    Form of the probabilities
    Form of the Probabilities

    • Thus, we impute a form to P(Y=0 | X) and P(Y=1 | X)

    • Consider P(Y=0 | X):

    • At one extreme the exponent approaches -

    • P(Y=0 | X) approaches 0

    • At the other it approaches +

    • P(Y=0 | X) approaches 1

    • Transitions in the middle as does P(Y=1 | X) which is just 1 - P(Y=0 | X)

    CS446-Fall ’06


    Class membership probability functions

    Y

    X 

    P(Y=1 | X)

    P(Y=0 | X)

    Class Membership Probability Functions

    • We can graph P(Y=0 | X) and P(Y=1 | X)

    w0=-5w1=15

    Classification boundary;Odds ratio is 1 hereLogit is 0: ln(1)=0

    CS446-Fall ’06


    Modeling control flexibility
    Modeling Control / Flexibility

    • The w‘s determine the behavior of the classifier

    • wi i=1,..n independently control the steepness for each feature

    • w0 repositions the classification transition

    • Choose the best w‘s for the training data

    CS446-Fall ’06


    What are the best w s for the training data
    What are the Best w‘s for the Training Data?

    • The training data are assumed to be independent

    • So we wantwhere Yl is the class and Xl are the featuresof the l’th training example

    • Equivilently, to expose the underlying linearity of example independence

    • Thus, we want the maximum likelihood estimation of w for the training data

    • Now Mitchell changes representation; so will we:

    CS446-Fall ’06


    Training
    Training

    • Note is invariant under representation change

    • Consider the sum

    • We wish to maximize this sum using W

    • There is no closed form solutionbut we can iterate using the gradient

    CS446-Fall ’06


    Weight update rule
    Weight Update Rule

    • Want to maximize

    • Let be probability of Y=1 given X for current Wso

    • Also to treat w0 consistently introduce X0 to be 1

    • Then the gradient components can be written

    CS446-Fall ’06


    Gradient interpretation
    Gradient Interpretation

    • We can view as the prediction error

    • This is multiplied by the feature value(this should look familiar)

    • Weight update rule is a step size / learning rate(this should also look familiar)

    CS446-Fall ’06


    Least mean square iterative algorithm
    Least Mean Square Iterative Algorithm

    • Recall from LMS gradient algorithm

    • Parial of the Error w.r.t. weight i

    • Yielding the weight update rule:

    CS446-Fall ’06


    Problems overfitting
    Problems - overfitting

    • What if the training data is linearly separable

    • What if the margin shrinks but due to a few data points(or just one)

    • What if the data is not quite linearly separable but due to just a few data points

    • Recall SVMs

    • We would like to prefer a large margin

    • Prefer less steep slopes

    • Even if it means misclassifying some points

    CS446-Fall ’06


    Regularization
    Regularization

    • Penalize complexity

    • What is complexity? The magnitude of W

    • Optimization problem becomes

    • Update rule becomes

    CS446-Fall ’06


    The bayes optimal classifier getting away from generative models our first ensemble method
    The Bayes Optimal ClassifierGetting away from generative modelsOur first ensemble method!

    • H is a parameterized hypothesis space

    • hML is the maximum likelihood hypothesis given some data.

    • Can we do better (higher expected accuracy) than hML?

    • Yes! We expect hMAP to outperform hML IF…

    • There is an interesting prior P(h) (ie, not uniform)

    • Can we do better (higher expected accuracy) than hMAP?

    • Yes! Bayes Optimal will outperform hMAP IF…(some assumptions)

    CS446-Fall ’06


    Bayes optimal classifier
    Bayes Optimal Classifier

    Getting another Doctor’s second opinion, another’s third opinion…

    One doctor is most confident. He is hML

    One doctor is most reliable / accurate. She is hMAP

    But she may only be a little more trustworthy than the others.

    What if hMAP says “+” but *all* other hH say “-”?

    • If P(hMAP |D) < 0.5, perhaps we should prefer “-”

    • Think of each hi as casting a weighted vote

    • Weight each hi by how likely it is to be correct given the training data.

      • Not just by P(h) which is already reflected in hMAP

      • Rather by P(h|D)

    • The most reliable joint opinion may contradict hMAP

    CS446-Fall ’06


    Bayes optimal classifier example
    Bayes Optimal Classifier: Example

    • Assume a space of 3 hypotheses

    • Given a new instance, assume that

      h1(x) = 1 h2(x) = 0 h3(x) = 0

    • In this case,

      P(f(x) =1 ) = 0.4 P(f(x) = 0) = 0.6 but hmap(x) =1

    • We want to determine the most probable classification by combining the prediction of all hypotheses

    • We can weight each by its posterior probabilities(there are additional lurking assumptions…)

    CS446-Fall ’06


    ad