1 / 28

Classification: Linear Models

Classification: Linear Models. Oliver Schulte Machine Learning 726. Linear Classification Models. General Idea: Learn linear continuous function y of continuous features x . Classify as positive if y crosses a threshold, typically 0.

Download Presentation

Classification: Linear Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. Classification: Linear Models Oliver Schulte Machine Learning 726

  2. Linear Classification Models • General Idea: • Learn linear continuous function y of continuous features x. • Classify as positive if y crosses a threshold, typically 0. • As in linear regression, can use more complicated features defined by basis functions ϕ.

  3. Example: Classifying Digits • Classify input vector as “4” vs. “not 4”. • Represent input image as vector x with 28x28 =784 numbers. • Target t = 1 for “positive”, -1 for “negative”. • Given a training set (x1,t1,..,xN,tN), the problem is find a good linear function y(x). • y:R784R. • Classify x as positive if y(x) >0, negative o.w.

  4. Other Examples • Will the person vote conservative, given age, income, previous votes? • Is the patient at risk of diabetes given body mass, age, blood test measurements? • Predict Earthquake vs. nuclear explosion given body wave magnitude and surface wave magnitude. surface wave magnitude body wave magnitude Age Income Votes Convervative disaster type

  5. Linear Separation x1 = surface wave magnitude x2 = body wave magnitude white = earthquake black = nuclear explosion Russell and Norvig Figure 18.15

  6. Linear Discriminants • Simple linear model: • Can drop explicit w0 if we assume fixed dummy bias. • Decision surface is line, orthogonal to w. • In 2-D, just try a line between the classes!

  7. Perceptron Learning

  8. Defining an Error Function • General idea: • Encode class label using a real number t. • e.g., “positive” = 1, “negative” = 0 or “negative” = -1. • Measure error by comparing continuous linear output y and class label code t.

  9. The Error Function for linear discriminants • Could use squared error as in linear regression. • Various problems (see book). Basically due to the fact that 1,-1 are not real target values. • Different criterion developed for learning perceptrons. • Perceptrons are a precursor to neural nets. • Analog implementation by Rosenblatt in the 1950s, see Figure 4.8.

  10. The Perceptron Criterion • An example is misclassified if • (Take a moment to verify this.) • Perceptron Error where M is the set of misclassified inputs, the mistakes. • Exercise: find the gradient of the error function wrt a single input xn.

  11. Perceptron Learning Algorithm • Use stochastic gradient descent. • gradient descent for one example at a time, cycle through. • Update Equation:where we set η= 1 (without loss of generality in this case). • Excel Demo.

  12. Perceptron Demo

  13. Perceptron Learning Analysis • Theorem If the classes are linearly separable, the perceptron learning algorithm converges to a weight vector that separates them. • Convergence can be slow. • Sensitive to initialization.

  14. Nonseparability • Linear discriminants can solve problems only if the classes can be separated by a line (hyperplane). • Canonical example of non-separable problem is X-OR. • Perceptron typically does not converge.

  15. Nonseparability: real world example x1 = surface wave magnitude x2 = body wave magnitude white = earthquake black = nuclear explosion Figure Russell and Norvig 18.15 b

  16. Responses to Nonseparability Classes cannot be separated by a linear discriminant use non-linear activation function finds approximate solution separate classes not completely but “well” add hidden features Fisher discriminant(not covered) logistic regression neural network support vector machine

  17. Logistic Regression

  18. From Values to Probabilities • Key idea: instead of predicting a class label, predict the probability of a class label. • E.g., p+ = P(class is positive|features) p- = P(class is negative|features) • Naturally a continuous quantity. • How to turn a real numbery into a probability p+?

  19. The Logistic Sigmoid Function • Definition: • Squeezes the real line into [0,1]. • Differentiable: (nice exercise)

  20. Soft threshold interpretation • If y> 0, σ(y) goes to 1 very quickly. • If y<0, σ(y) goes to 0 very quickly. Figure Russell and Norvig 18.17

  21. Probabilistic Interpretation • The sigmoid can be interpreted in terms of the class oddsp+/(1-p+). • Exercise: Show the following implication for the class odds: • Therefore the log class odds.

  22. Logistic Regression • In logistic regression, the log-class odds are a linear function of the input features: • Recall that we got the same kind of expression for the naive Bayes classifier. • Learning logistic regression is conceptually similar to linear regression.

  23. Logistic Regression: Maximum Likelihood • Notation: the probability that the n-th input example is positive = which depends on a weight vector w. • Positive example has tn = 1, negative tn = 0. • Then the likelihood assigned to N independent training data is • The cross-entropy error • Equivalent to minimizing the KL divergence between the predicted class probabilities and the observed class frequencies.

  24. Gradient Search • Exercise (on assignment): Using the cross-entropy errorshow that • Hint: recall that • No closed form minimum since is non-linear function of input features. • Can use gradient descent. • Better approach: Use Iterative Reweighted Least Squares (IRLS). See assignment.

  25. Example logistic regression model learned on non-separable data Figure Russell and Norvig 18.17

  26. Logistic Regression With Basis Functions Figure Bishop 4.12

  27. Multi-Class Example • Logistic regression can be extended to multiple classes. • Here’s a picture of what decision boundaries can look like.

More Related