1 / 54

Part IV Perceptron Learning

Part IV Perceptron Learning. NASA Space Program. Outline. Linear separability Linear classifiers and linear decision boundaries The perceptron classifier Perceptron learning algorithhm basic principles pseudocode. Decision Boundaries. What is a Classifier?

talbot
Download Presentation

Part IV Perceptron Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Part IVPerceptron Learning NASA Space Program

  2. Outline • Linear separability • Linear classifiers and linear decision boundaries • The perceptron classifier • Perceptron learning algorithhm • basic principles • pseudocode

  3. Decision Boundaries • What is a Classifier? • A classifier is a mapping from feature space (a d-dimensional vector) to the class labels {1, 2, … m} • Thus, a classifier partitions the feature space into m decision regions • The line or curve separating the 2 classes is the decision boundary • in more than 2 dimensions this is a surface (e.g., a hyperplane) • Linear Classifiers • a linear classifier is a mapping which partitions feature space using a linear function (a straight line, or a hyperplane) • it is one of the simplest classifiers we can imagine • “separate the two classes using a straight line in feature space” • in 2 dimensions the decision boundary is a straight line

  4. 2-Class Data with a Linear Decision Boundary

  5. Non-Linearly Separable Data, with Decision Boundary

  6. Convex Hull of a Set of Points • Convex Hull of a set of Q points: • Intuitively • think of each point in Q as a nail sticking out from a 2d board • the convex hull = the shape formed by a tight rubber band that surrounds all the nails • Formally: “the convex hull is the smallest convex polygon P for which each point in Q is either on the boundary of P or in its interior” • (p.898, Cormen, Leiserson, and Rivest, Introduction to Algorithms) • can be found (for n points) in time n log n

  7. Convex Hull of a Set of Points • Relation to Class Overlap • define convex hulls of data points D1 and D2 as P1 and P2 • if P1 and P2 intersect, then we have overlap • If P1 and P2 intersect then D1 and D2 are not linearly seperable

  8. Convex Hull Example

  9. Convex Hull Example Feature 2 Feature 1

  10. Convex Hull Example Convex Hull P1 Feature 2 Feature 1

  11. Data from 2 Classes: Linearly Separable? Feature 2 x x x x x x x Feature 1

  12. Data from 2 Classes: Linearly Separable? Convex Hull P1 Feature 2 x x x x x x x Feature 1

  13. Data from 2 Classes: Linearly Separable? Convex Hull P1 Convex Hull P2 Feature 2 x x x x x x The 2 Hulls intersect => data from each class are not linearly separable x Feature 1

  14. Different data => linearly separable Convex Hull P1 Convex Hull P2 Feature 2 x x x x x x x Feature 1

  15. Some Theory Let N be the number of data points Let d be the dimension of the data points Let F(N, d) = the fraction of dichotomies of N points in d dimensions that are linearly separable Then, it can be shown that: = 1 if N < d+2 F(N, d) = (1 / 2 N-1 )Sdi=0 (N-1)! / [ (N-1-i)! i! ] if N > d

  16. Fraction of Points that are Linearly Separable F(N,d) d = infinity 1 d = 2 0.5 d = 1 0 N/(d+1) 1 2 3

  17. A Linear Classifier in 2 Dimensions Let Feature 1 be called X Let Feature 2 be called Y A linear classifier is a linear function of X and Y, i.e., it computes f(X,Y) = aX + bY + c Here a, b, and c are the “weights” of the classifier Define the output of the linear classifier to be T(f) = -1, if f <= 0 T(f) = +1, if f > 0 if f(X,Y) <= 0, the classifier produces a “-1” (Decision Region 1) if f(X,Y) > 0, the classifier produces a “+1” (Decision Region 2)

  18. Decision Boundaries for a 2d Linear Classifier Depending on whether f(X,Y) is > or < 0, the features (X,Y) get classified into class 1 or class 2 Thus, f(X,Y) = 0 must define the decision boundary between class 1 and 2, i.e., f(X,Y) = aX + bY + c = 0 Thus, defining a, b, and c automatically locates the decision boundary in X,Y space In summary: - a classifier defines a decision boundaries between classes - for a linear classifier, this boundary is a line or a plane - the equation of the plane is defined by the parameters of the classifier

  19. An Example of a Linear Decision Boundary

  20. A Better Linear Decision Boundary

  21. The Perceptron Classifier (for 2 features) X w1 w2 f = w1 X + w2 Y + w3 {-1, +1} Y T(f) w3 Threshold Function Output = class decision Weighted Sum of the inputs 1

  22. Perceptrons • Perceptron = a linear classifier • The w’s are the weights (denoted as a, b,c, earlier) • real-valued constants (can be positive or negative) • Define an additional constant input “1” (allows an intercept in decision boundary)

  23. Perceptrons • A perceptron calculates 2 quantities: • 1. A weighted sum of the input features • 2. This sum is then thresholded by the T function • A simple artificial model of human neurons • weights = “synapses” • threshold = “neuron firing”

  24. Notation • Inputs: • x0, x1, x2, …………, xd-1, xd • x0 = 1 (a constant input) • x1, x2, …………, xd-1, xd are the values of the d features • x = ( x0, x1, x2, …………, xd-1, xd ) • Weights: • w0, w1, w2, …………, wd-1, wd • we have d+1 weights • one for each feature, + one for the constant • w = ( w0, w1, w2, …………, wd-1, wd )

  25. Perceptron Operation • Equations of operation: • = 1 (if w0x0 + w1x1 +… wd xd > 0) • o(x1, x2,…, xd-1, xd) • = -1 (otherwise) • Note that: • w = (w0,….. wd), the “weight vector” (row vector, 1 x d+1) • and x = (x0,…… xd), the “feature vector” (row vector, 1 x d+1) • => w0x0 + w1x1 +… wd xd = w . x’ • and w . x’ is the “vector inner product” (w*x’ in MATLAB)

  26. Perceptron Decision Boundary • Equations of operation (in vector form): • = 1 (if w . x’ > 0) • o(x1, x2,…, xd-1, xd) • = -1 (otherwise) • The perceptron represents a hyperplane decision surface in d-dimensional space, e.g., a line in 2d, a plane in 3d, etc. • The equation of the hyperplane is w . x’ = 0 • This is the equation for points in x-space that are on the boundary

  27. Example of Perceptron Decision Boundary w = (w0, w1,w2) = (0, 1, -1) w . x’ = 0 => 0.1 + 1. x1 - 1. x2 = 0 => x1 - x2 = 0 => x1 = x2 x2 x1

  28. Example of Perceptron Decision Boundary w = (w0, w1,w2) = (0, 1, -1) w . x’ = 0 => 0.1 + 1. x1 - 1. x2 = 0 => x1 - x2 = 0 => x1 = x2 This is the equation for the decision boundary x2 x1

  29. Example of Perceptron Decision Boundary w = (w0, w1,w2) = (0, 1, -1) w . x’ < 0 => x1 - x2 < 0 => x1 < x2 (this is the equation for decision region -1) w . x’ = 0 x2 x1

  30. Example of Perceptron Decision Boundary w = (w0, w1,w2) = (0, 1, -1) w . x’ = 0 x2 w . X’ < 0 w . x’ > 0 => x1 - x2 > 0 => x1 > x2 (this is the equation for decision region +1) x1

  31. Representational Power of Perceptrons • What mappings can a perceptron represent perfectly? • A perceptron is a linear classifier • thus it can represent any mapping that is linearly separable • some Boolean functions like AND (on left) • but not Boolean functions like XOR (on right) 0 x 0 x 0 0 x 0

  32. Learning the Classifier Parameters • Where do the parameters (weights) of the classifier come from? • If we know a lot about the problem, we could “design” them • Typically we don’t know ahead of time what the values should be

  33. Learning the Classifier Parameters • Learning from Training Data: • training data = labeled feature vectors • i.e., a set of N feature vectors each with a class label • we can use the training data to try to find good parameters • “good” parameters are ones which provide low error on the training data • Statement of the Learning Problem: • given a classifier, and some training data, find the values for the classifier’s parameters which maximize accuracy

  34. Learning the Weights from Data An Example of a Training Data Set Example x1 x2 …. xd True Class Label (“target” t) x(1)3.4 -1.2 ….. 7.1 1 x(2)4.1 -3.1 ….. 4.6 -1 x(3)5.7 -1.0 ….. 6.2 -1 x(4)2.2 4.1 ….. 5.0 1 x(n)1.2 4.3 ….. 6.1 1

  35. Learning as a Search Problem • The objective function: • the accuracy of the classifier (for a given set of weights w and labeled data) • Problem: • maximize this objective function

  36. Learning as a Search Problem • Equivalent to an optimization or search problem • i.e., think of the vector (w1, w2, w3) • this defines a 3-dimensional “parameter space” • we want to find the value of (w1, w2, w3) which maximizes the objective function • So we could use hill-climbing, systematic search, etc., to search this parameter space • many learning algorithms = hill-climbing with random restarts

  37. Classification Accuracy • Say we have N feature vectors • Say we know the true class label for each feature vector • We can measure how accurate a classifier is by how many feature vectors it classifies correctly

  38. Classification Accuracy • Procedure: • accuracy = 0 • For each of the N feature vectors: • calculate the output of the classifier for this vector • if the class label agrees with the true label • ++ accuracy • continue • Percentage Accuracy = (accuracy/N) * 100%

  39. Training Data and Learning • We have N examples • an example consists of a feature vector and a class label (target) • the ith feature vector is denoted x(i) • Learning on the Training Data • Let o(i) = o( x(i) ) be the output of a perceptron for the ith feature vector x(i) • Let t(i) be the target value • goal is to find perceptron weights such that • o(i) is close to t(i) for as many examples as possible • i.e., the output matches the desired target as often as possible TrainingAccuracy(w) = 1/N S I( o(i), t(i) ) where I( o(i), t(i) ) = 1 if o(i) = t(i), and 0 otherwise

  40. Mean-Squared Error as a Measure of Accuracy • Consider training an “unthresholded” perceptron, • consider the sum without the threshold, f(x) = w . x’ • this is a real number • saw we want this real number to match the target as closely as possible, for each input x • Error for ith example = ei (w) = ( f(i) - t(i) )2 • Mean-squared error, E(w) = (1/N) Si ( f(i) - t(i) )2= (1/N) Si (w .x’(i) - t(i) )2 • We try to choose the weights which minimize E • note that f(i) is a function of the weights w • f(i) = w . x’(i) • thus, E is also a function the weights, i.e., E(w)

  41. Gradient Descent Learning Rule • Weight update rule: • wj <- wj + h ( t(i) - f(i) ) xj(i) • t(i) and f(i) are the target and weighted sum (respectively) for the ith example • wj is the jth input weight • xj(i) is the jth input feature value, for the ith example • h is called the learning rate: a small positive number, 0 < h < 1 • An example of how this works: • Say wj and xj(i) are both positive: • say t(i) > f(i) => we increase the value of the weight • say t(i) < f(i) => we decrease the value of the weight • h controls how quickly we increase or decrease the weight

  42. Perceptron Training Algorithm • Algorithm outline • initialize the weights (e.g., randomly) • loop through all N examples (this is 1 iteration) • for each example calculate the output • calculate the difference between the output and the target • update each of the d+1 weights using the gradient update rule • after all N examples are gone through • check if the overall error (MSE) has decreased significantly since the previous iteration • if not, then perform another iteration through all N examples • if so, then

  43. Pseudo-code for Perceptron Training • Inputs: d features, N targets (class labels), learning rate h • Outputs: a set of learned weights • Initialize each wj (e.g., to a small random value) • While (termination condition not satisfied) • for i = 1: N • for j=0: d • deltaw = h ( t(i) - f(i) ) xj(i) • wj = wj + deltaw • end • calculate termination condition • end

  44. Comments on Incremental Learning • will converge to a minimum of E under certain assumptions • if the learning rate is small enough • converges even if the data are not linearly separable • not guaranteed to minimize classification accuracy (but usually will in practice) • typical learning rates: from 0.01 to 0.00001 • can depend on the particular data set • can be adapted (adjusted) automatically during learning • termination conditions: • e.g., when either Accuracy or Mean Squared Error hardly change from one iteration to the next (one iteration of the while loop) • e.g., change in mean square error < 0.0001

  45. Summary of Perceptron Learning Algorithms • “Error Correcting” Perceptron Training rule: • “error correction”, works with thresholded output • guaranteed to converge to zero error if examples are linearly separable • otherwise no guarantee of convergence • “Gradient Descent” Perceptron Training • Incremental version (this is the version for Assignment 3): • updates the weights after visiting each example: cycles through the examples • one iteration is defined as one pass through all N examples (N weight updates) • Batch version (we are not using this) • updates the weights after going through all the examples

  46. MATLAB Demo of Perceptron Learning • sampledata1: linearly separable • sampledata2: not linearly separable • Illustration of • different learning rates • different initialization methods • different convergence criteria • Visualization of • decision boundaries • MSE and Accuracy as a function of iteration number

  47. Assignment function [outputs] = perceptron(weights,data) % function [outputs] = perceptron(weights,data) % % brief description of what the function does % ...... % Your Name, ICS 175A, date % % Inputs % weights: 1 x (d+1) row vector of weights % data: N x (d+1) matrix of training data % % Outputs % outputs: N x 1 vector of perceptron outputs -------- Your code goes here -------

  48. Assignment function [cerror, mse] = perceptron_error(weights,data,targets) % function [cerror, mse] = perceptron_error(weights,data,targets) % % brief description of what the function does % ...... % Your Name, ICS 175A, date % % Inputs % weights: 1 x (d+1) row vector of weights % data: N x (d+1) matrix of training data % targets: N x 1 vector of target values (+1 or -1) % % Outputs % cerror: the percentage of examples misclassified (between 0 and 100) % mse: the mean-squared error (sum of squared errors divided by N) -------- Your code goes here -------

More Related