Download Presentation
## Part IV Perceptron Learning

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Part IVPerceptron Learning**NASA Space Program**Outline**• Linear separability • Linear classifiers and linear decision boundaries • The perceptron classifier • Perceptron learning algorithhm • basic principles • pseudocode**Decision Boundaries**• What is a Classifier? • A classifier is a mapping from feature space (a d-dimensional vector) to the class labels {1, 2, … m} • Thus, a classifier partitions the feature space into m decision regions • The line or curve separating the 2 classes is the decision boundary • in more than 2 dimensions this is a surface (e.g., a hyperplane) • Linear Classifiers • a linear classifier is a mapping which partitions feature space using a linear function (a straight line, or a hyperplane) • it is one of the simplest classifiers we can imagine • “separate the two classes using a straight line in feature space” • in 2 dimensions the decision boundary is a straight line**Convex Hull of a Set of Points**• Convex Hull of a set of Q points: • Intuitively • think of each point in Q as a nail sticking out from a 2d board • the convex hull = the shape formed by a tight rubber band that surrounds all the nails • Formally: “the convex hull is the smallest convex polygon P for which each point in Q is either on the boundary of P or in its interior” • (p.898, Cormen, Leiserson, and Rivest, Introduction to Algorithms) • can be found (for n points) in time n log n**Convex Hull of a Set of Points**• Relation to Class Overlap • define convex hulls of data points D1 and D2 as P1 and P2 • if P1 and P2 intersect, then we have overlap • If P1 and P2 intersect then D1 and D2 are not linearly seperable**Convex Hull Example**Feature 2 Feature 1**Convex Hull Example**Convex Hull P1 Feature 2 Feature 1**Data from 2 Classes: Linearly Separable?**Feature 2 x x x x x x x Feature 1**Data from 2 Classes: Linearly Separable?**Convex Hull P1 Feature 2 x x x x x x x Feature 1**Data from 2 Classes: Linearly Separable?**Convex Hull P1 Convex Hull P2 Feature 2 x x x x x x The 2 Hulls intersect => data from each class are not linearly separable x Feature 1**Different data => linearly separable**Convex Hull P1 Convex Hull P2 Feature 2 x x x x x x x Feature 1**Some Theory**Let N be the number of data points Let d be the dimension of the data points Let F(N, d) = the fraction of dichotomies of N points in d dimensions that are linearly separable Then, it can be shown that: = 1 if N < d+2 F(N, d) = (1 / 2 N-1 )Sdi=0 (N-1)! / [ (N-1-i)! i! ] if N > d**Fraction of Points that are Linearly Separable**F(N,d) d = infinity 1 d = 2 0.5 d = 1 0 N/(d+1) 1 2 3**A Linear Classifier in 2 Dimensions**Let Feature 1 be called X Let Feature 2 be called Y A linear classifier is a linear function of X and Y, i.e., it computes f(X,Y) = aX + bY + c Here a, b, and c are the “weights” of the classifier Define the output of the linear classifier to be T(f) = -1, if f <= 0 T(f) = +1, if f > 0 if f(X,Y) <= 0, the classifier produces a “-1” (Decision Region 1) if f(X,Y) > 0, the classifier produces a “+1” (Decision Region 2)**Decision Boundaries for a 2d Linear Classifier**Depending on whether f(X,Y) is > or < 0, the features (X,Y) get classified into class 1 or class 2 Thus, f(X,Y) = 0 must define the decision boundary between class 1 and 2, i.e., f(X,Y) = aX + bY + c = 0 Thus, defining a, b, and c automatically locates the decision boundary in X,Y space In summary: - a classifier defines a decision boundaries between classes - for a linear classifier, this boundary is a line or a plane - the equation of the plane is defined by the parameters of the classifier**The Perceptron Classifier (for 2 features)**X w1 w2 f = w1 X + w2 Y + w3 {-1, +1} Y T(f) w3 Threshold Function Output = class decision Weighted Sum of the inputs 1**Perceptrons**• Perceptron = a linear classifier • The w’s are the weights (denoted as a, b,c, earlier) • real-valued constants (can be positive or negative) • Define an additional constant input “1” (allows an intercept in decision boundary)**Perceptrons**• A perceptron calculates 2 quantities: • 1. A weighted sum of the input features • 2. This sum is then thresholded by the T function • A simple artificial model of human neurons • weights = “synapses” • threshold = “neuron firing”**Notation**• Inputs: • x0, x1, x2, …………, xd-1, xd • x0 = 1 (a constant input) • x1, x2, …………, xd-1, xd are the values of the d features • x = ( x0, x1, x2, …………, xd-1, xd ) • Weights: • w0, w1, w2, …………, wd-1, wd • we have d+1 weights • one for each feature, + one for the constant • w = ( w0, w1, w2, …………, wd-1, wd )**Perceptron Operation**• Equations of operation: • = 1 (if w0x0 + w1x1 +… wd xd > 0) • o(x1, x2,…, xd-1, xd) • = -1 (otherwise) • Note that: • w = (w0,….. wd), the “weight vector” (row vector, 1 x d+1) • and x = (x0,…… xd), the “feature vector” (row vector, 1 x d+1) • => w0x0 + w1x1 +… wd xd = w . x’ • and w . x’ is the “vector inner product” (w*x’ in MATLAB)**Perceptron Decision Boundary**• Equations of operation (in vector form): • = 1 (if w . x’ > 0) • o(x1, x2,…, xd-1, xd) • = -1 (otherwise) • The perceptron represents a hyperplane decision surface in d-dimensional space, e.g., a line in 2d, a plane in 3d, etc. • The equation of the hyperplane is w . x’ = 0 • This is the equation for points in x-space that are on the boundary**Example of Perceptron Decision Boundary**w = (w0, w1,w2) = (0, 1, -1) w . x’ = 0 => 0.1 + 1. x1 - 1. x2 = 0 => x1 - x2 = 0 => x1 = x2 x2 x1**Example of Perceptron Decision Boundary**w = (w0, w1,w2) = (0, 1, -1) w . x’ = 0 => 0.1 + 1. x1 - 1. x2 = 0 => x1 - x2 = 0 => x1 = x2 This is the equation for the decision boundary x2 x1**Example of Perceptron Decision Boundary**w = (w0, w1,w2) = (0, 1, -1) w . x’ < 0 => x1 - x2 < 0 => x1 < x2 (this is the equation for decision region -1) w . x’ = 0 x2 x1**Example of Perceptron Decision Boundary**w = (w0, w1,w2) = (0, 1, -1) w . x’ = 0 x2 w . X’ < 0 w . x’ > 0 => x1 - x2 > 0 => x1 > x2 (this is the equation for decision region +1) x1**Representational Power of Perceptrons**• What mappings can a perceptron represent perfectly? • A perceptron is a linear classifier • thus it can represent any mapping that is linearly separable • some Boolean functions like AND (on left) • but not Boolean functions like XOR (on right) 0 x 0 x 0 0 x 0**Learning the Classifier Parameters**• Where do the parameters (weights) of the classifier come from? • If we know a lot about the problem, we could “design” them • Typically we don’t know ahead of time what the values should be**Learning the Classifier Parameters**• Learning from Training Data: • training data = labeled feature vectors • i.e., a set of N feature vectors each with a class label • we can use the training data to try to find good parameters • “good” parameters are ones which provide low error on the training data • Statement of the Learning Problem: • given a classifier, and some training data, find the values for the classifier’s parameters which maximize accuracy**Learning the Weights from Data**An Example of a Training Data Set Example x1 x2 …. xd True Class Label (“target” t) x(1)3.4 -1.2 ….. 7.1 1 x(2)4.1 -3.1 ….. 4.6 -1 x(3)5.7 -1.0 ….. 6.2 -1 x(4)2.2 4.1 ….. 5.0 1 x(n)1.2 4.3 ….. 6.1 1**Learning as a Search Problem**• The objective function: • the accuracy of the classifier (for a given set of weights w and labeled data) • Problem: • maximize this objective function**Learning as a Search Problem**• Equivalent to an optimization or search problem • i.e., think of the vector (w1, w2, w3) • this defines a 3-dimensional “parameter space” • we want to find the value of (w1, w2, w3) which maximizes the objective function • So we could use hill-climbing, systematic search, etc., to search this parameter space • many learning algorithms = hill-climbing with random restarts**Classification Accuracy**• Say we have N feature vectors • Say we know the true class label for each feature vector • We can measure how accurate a classifier is by how many feature vectors it classifies correctly**Classification Accuracy**• Procedure: • accuracy = 0 • For each of the N feature vectors: • calculate the output of the classifier for this vector • if the class label agrees with the true label • ++ accuracy • continue • Percentage Accuracy = (accuracy/N) * 100%**Training Data and Learning**• We have N examples • an example consists of a feature vector and a class label (target) • the ith feature vector is denoted x(i) • Learning on the Training Data • Let o(i) = o( x(i) ) be the output of a perceptron for the ith feature vector x(i) • Let t(i) be the target value • goal is to find perceptron weights such that • o(i) is close to t(i) for as many examples as possible • i.e., the output matches the desired target as often as possible TrainingAccuracy(w) = 1/N S I( o(i), t(i) ) where I( o(i), t(i) ) = 1 if o(i) = t(i), and 0 otherwise**Mean-Squared Error as a Measure of Accuracy**• Consider training an “unthresholded” perceptron, • consider the sum without the threshold, f(x) = w . x’ • this is a real number • saw we want this real number to match the target as closely as possible, for each input x • Error for ith example = ei (w) = ( f(i) - t(i) )2 • Mean-squared error, E(w) = (1/N) Si ( f(i) - t(i) )2= (1/N) Si (w .x’(i) - t(i) )2 • We try to choose the weights which minimize E • note that f(i) is a function of the weights w • f(i) = w . x’(i) • thus, E is also a function the weights, i.e., E(w)**Gradient Descent Learning Rule**• Weight update rule: • wj <- wj + h ( t(i) - f(i) ) xj(i) • t(i) and f(i) are the target and weighted sum (respectively) for the ith example • wj is the jth input weight • xj(i) is the jth input feature value, for the ith example • h is called the learning rate: a small positive number, 0 < h < 1 • An example of how this works: • Say wj and xj(i) are both positive: • say t(i) > f(i) => we increase the value of the weight • say t(i) < f(i) => we decrease the value of the weight • h controls how quickly we increase or decrease the weight**Perceptron Training Algorithm**• Algorithm outline • initialize the weights (e.g., randomly) • loop through all N examples (this is 1 iteration) • for each example calculate the output • calculate the difference between the output and the target • update each of the d+1 weights using the gradient update rule • after all N examples are gone through • check if the overall error (MSE) has decreased significantly since the previous iteration • if not, then perform another iteration through all N examples • if so, then**Pseudo-code for Perceptron Training**• Inputs: d features, N targets (class labels), learning rate h • Outputs: a set of learned weights • Initialize each wj (e.g., to a small random value) • While (termination condition not satisfied) • for i = 1: N • for j=0: d • deltaw = h ( t(i) - f(i) ) xj(i) • wj = wj + deltaw • end • calculate termination condition • end**Comments on Incremental Learning**• will converge to a minimum of E under certain assumptions • if the learning rate is small enough • converges even if the data are not linearly separable • not guaranteed to minimize classification accuracy (but usually will in practice) • typical learning rates: from 0.01 to 0.00001 • can depend on the particular data set • can be adapted (adjusted) automatically during learning • termination conditions: • e.g., when either Accuracy or Mean Squared Error hardly change from one iteration to the next (one iteration of the while loop) • e.g., change in mean square error < 0.0001**Summary of Perceptron Learning Algorithms**• “Error Correcting” Perceptron Training rule: • “error correction”, works with thresholded output • guaranteed to converge to zero error if examples are linearly separable • otherwise no guarantee of convergence • “Gradient Descent” Perceptron Training • Incremental version (this is the version for Assignment 3): • updates the weights after visiting each example: cycles through the examples • one iteration is defined as one pass through all N examples (N weight updates) • Batch version (we are not using this) • updates the weights after going through all the examples**MATLAB Demo of Perceptron Learning**• sampledata1: linearly separable • sampledata2: not linearly separable • Illustration of • different learning rates • different initialization methods • different convergence criteria • Visualization of • decision boundaries • MSE and Accuracy as a function of iteration number**Assignment**function [outputs] = perceptron(weights,data) % function [outputs] = perceptron(weights,data) % % brief description of what the function does % ...... % Your Name, ICS 175A, date % % Inputs % weights: 1 x (d+1) row vector of weights % data: N x (d+1) matrix of training data % % Outputs % outputs: N x 1 vector of perceptron outputs -------- Your code goes here -------**Assignment**function [cerror, mse] = perceptron_error(weights,data,targets) % function [cerror, mse] = perceptron_error(weights,data,targets) % % brief description of what the function does % ...... % Your Name, ICS 175A, date % % Inputs % weights: 1 x (d+1) row vector of weights % data: N x (d+1) matrix of training data % targets: N x 1 vector of target values (+1 or -1) % % Outputs % cerror: the percentage of examples misclassified (between 0 and 100) % mse: the mean-squared error (sum of squared errors divided by N) -------- Your code goes here -------