Part IV Perceptron Learning

Part IVPerceptron Learning NASA Space Program

Outline • Linear separability • Linear classifiers and linear decision boundaries • The perceptron classifier • Perceptron learning algorithhm • basic principles • pseudocode

Decision Boundaries • What is a Classifier? • A classifier is a mapping from feature space (a d-dimensional vector) to the class labels {1, 2, … m} • Thus, a classifier partitions the feature space into m decision regions • The line or curve separating the 2 classes is the decision boundary • in more than 2 dimensions this is a surface (e.g., a hyperplane) • Linear Classifiers • a linear classifier is a mapping which partitions feature space using a linear function (a straight line, or a hyperplane) • it is one of the simplest classifiers we can imagine • “separate the two classes using a straight line in feature space” • in 2 dimensions the decision boundary is a straight line

2-Class Data with a Linear Decision Boundary

Non-Linearly Separable Data, with Decision Boundary

Convex Hull of a Set of Points • Convex Hull of a set of Q points: • Intuitively • think of each point in Q as a nail sticking out from a 2d board • the convex hull = the shape formed by a tight rubber band that surrounds all the nails • Formally: “the convex hull is the smallest convex polygon P for which each point in Q is either on the boundary of P or in its interior” • (p.898, Cormen, Leiserson, and Rivest, Introduction to Algorithms) • can be found (for n points) in time n log n

Convex Hull of a Set of Points • Relation to Class Overlap • define convex hulls of data points D1 and D2 as P1 and P2 • if P1 and P2 intersect, then we have overlap • If P1 and P2 intersect then D1 and D2 are not linearly seperable

Convex Hull Example

Convex Hull Example Feature 2 Feature 1

Convex Hull Example Convex Hull P1 Feature 2 Feature 1

Data from 2 Classes: Linearly Separable? Feature 2 x x x x x x x Feature 1

Data from 2 Classes: Linearly Separable? Convex Hull P1 Feature 2 x x x x x x x Feature 1

Data from 2 Classes: Linearly Separable? Convex Hull P1 Convex Hull P2 Feature 2 x x x x x x The 2 Hulls intersect => data from each class are not linearly separable x Feature 1

Different data => linearly separable Convex Hull P1 Convex Hull P2 Feature 2 x x x x x x x Feature 1

Some Theory Let N be the number of data points Let d be the dimension of the data points Let F(N, d) = the fraction of dichotomies of N points in d dimensions that are linearly separable Then, it can be shown that: = 1 if N < d+2 F(N, d) = (1 / 2 N-1 )Sdi=0 (N-1)! / [ (N-1-i)! i! ] if N > d

Fraction of Points that are Linearly Separable F(N,d) d = infinity 1 d = 2 0.5 d = 1 0 N/(d+1) 1 2 3

A Linear Classifier in 2 Dimensions Let Feature 1 be called X Let Feature 2 be called Y A linear classifier is a linear function of X and Y, i.e., it computes f(X,Y) = aX + bY + c Here a, b, and c are the “weights” of the classifier Define the output of the linear classifier to be T(f) = -1, if f <= 0 T(f) = +1, if f > 0 if f(X,Y) <= 0, the classifier produces a “-1” (Decision Region 1) if f(X,Y) > 0, the classifier produces a “+1” (Decision Region 2)

Decision Boundaries for a 2d Linear Classifier Depending on whether f(X,Y) is > or < 0, the features (X,Y) get classified into class 1 or class 2 Thus, f(X,Y) = 0 must define the decision boundary between class 1 and 2, i.e., f(X,Y) = aX + bY + c = 0 Thus, defining a, b, and c automatically locates the decision boundary in X,Y space In summary: - a classifier defines a decision boundaries between classes - for a linear classifier, this boundary is a line or a plane - the equation of the plane is defined by the parameters of the classifier

An Example of a Linear Decision Boundary

A Better Linear Decision Boundary

The Perceptron Classifier (for 2 features) X w1 w2 f = w1 X + w2 Y + w3 {-1, +1} Y T(f) w3 Threshold Function Output = class decision Weighted Sum of the inputs 1

Perceptrons • Perceptron = a linear classifier • The w’s are the weights (denoted as a, b,c, earlier) • real-valued constants (can be positive or negative) • Define an additional constant input “1” (allows an intercept in decision boundary)

Perceptrons • A perceptron calculates 2 quantities: • 1. A weighted sum of the input features • 2. This sum is then thresholded by the T function • A simple artificial model of human neurons • weights = “synapses” • threshold = “neuron firing”

Notation • Inputs: • x0, x1, x2, …………, xd-1, xd • x0 = 1 (a constant input) • x1, x2, …………, xd-1, xd are the values of the d features • x = ( x0, x1, x2, …………, xd-1, xd ) • Weights: • w0, w1, w2, …………, wd-1, wd • we have d+1 weights • one for each feature, + one for the constant • w = ( w0, w1, w2, …………, wd-1, wd )

Perceptron Operation • Equations of operation: • = 1 (if w0x0 + w1x1 +… wd xd > 0) • o(x1, x2,…, xd-1, xd) • = -1 (otherwise) • Note that: • w = (w0,….. wd), the “weight vector” (row vector, 1 x d+1) • and x = (x0,…… xd), the “feature vector” (row vector, 1 x d+1) • => w0x0 + w1x1 +… wd xd = w . x’ • and w . x’ is the “vector inner product” (w*x’ in MATLAB)

Perceptron Decision Boundary • Equations of operation (in vector form): • = 1 (if w . x’ > 0) • o(x1, x2,…, xd-1, xd) • = -1 (otherwise) • The perceptron represents a hyperplane decision surface in d-dimensional space, e.g., a line in 2d, a plane in 3d, etc. • The equation of the hyperplane is w . x’ = 0 • This is the equation for points in x-space that are on the boundary

Example of Perceptron Decision Boundary w = (w0, w1,w2) = (0, 1, -1) w . x’ = 0 => 0.1 + 1. x1 - 1. x2 = 0 => x1 - x2 = 0 => x1 = x2 x2 x1

Example of Perceptron Decision Boundary w = (w0, w1,w2) = (0, 1, -1) w . x’ = 0 => 0.1 + 1. x1 - 1. x2 = 0 => x1 - x2 = 0 => x1 = x2 This is the equation for the decision boundary x2 x1

Example of Perceptron Decision Boundary w = (w0, w1,w2) = (0, 1, -1) w . x’ < 0 => x1 - x2 < 0 => x1 < x2 (this is the equation for decision region -1) w . x’ = 0 x2 x1

Example of Perceptron Decision Boundary w = (w0, w1,w2) = (0, 1, -1) w . x’ = 0 x2 w . X’ < 0 w . x’ > 0 => x1 - x2 > 0 => x1 > x2 (this is the equation for decision region +1) x1

Representational Power of Perceptrons • What mappings can a perceptron represent perfectly? • A perceptron is a linear classifier • thus it can represent any mapping that is linearly separable • some Boolean functions like AND (on left) • but not Boolean functions like XOR (on right) 0 x 0 x 0 0 x 0

Learning the Classifier Parameters • Where do the parameters (weights) of the classifier come from? • If we know a lot about the problem, we could “design” them • Typically we don’t know ahead of time what the values should be

Learning the Classifier Parameters • Learning from Training Data: • training data = labeled feature vectors • i.e., a set of N feature vectors each with a class label • we can use the training data to try to find good parameters • “good” parameters are ones which provide low error on the training data • Statement of the Learning Problem: • given a classifier, and some training data, find the values for the classifier’s parameters which maximize accuracy

Learning the Weights from Data An Example of a Training Data Set Example x1 x2 …. xd True Class Label (“target” t) x(1)3.4 -1.2 ….. 7.1 1 x(2)4.1 -3.1 ….. 4.6 -1 x(3)5.7 -1.0 ….. 6.2 -1 x(4)2.2 4.1 ….. 5.0 1 x(n)1.2 4.3 ….. 6.1 1

Learning as a Search Problem • The objective function: • the accuracy of the classifier (for a given set of weights w and labeled data) • Problem: • maximize this objective function

Learning as a Search Problem • Equivalent to an optimization or search problem • i.e., think of the vector (w1, w2, w3) • this defines a 3-dimensional “parameter space” • we want to find the value of (w1, w2, w3) which maximizes the objective function • So we could use hill-climbing, systematic search, etc., to search this parameter space • many learning algorithms = hill-climbing with random restarts

Classification Accuracy • Say we have N feature vectors • Say we know the true class label for each feature vector • We can measure how accurate a classifier is by how many feature vectors it classifies correctly

Classification Accuracy • Procedure: • accuracy = 0 • For each of the N feature vectors: • calculate the output of the classifier for this vector • if the class label agrees with the true label • ++ accuracy • continue • Percentage Accuracy = (accuracy/N) * 100%

Training Data and Learning • We have N examples • an example consists of a feature vector and a class label (target) • the ith feature vector is denoted x(i) • Learning on the Training Data • Let o(i) = o( x(i) ) be the output of a perceptron for the ith feature vector x(i) • Let t(i) be the target value • goal is to find perceptron weights such that • o(i) is close to t(i) for as many examples as possible • i.e., the output matches the desired target as often as possible TrainingAccuracy(w) = 1/N S I( o(i), t(i) ) where I( o(i), t(i) ) = 1 if o(i) = t(i), and 0 otherwise

Mean-Squared Error as a Measure of Accuracy • Consider training an “unthresholded” perceptron, • consider the sum without the threshold, f(x) = w . x’ • this is a real number • saw we want this real number to match the target as closely as possible, for each input x • Error for ith example = ei (w) = ( f(i) - t(i) )2 • Mean-squared error, E(w) = (1/N) Si ( f(i) - t(i) )2= (1/N) Si (w .x’(i) - t(i) )2 • We try to choose the weights which minimize E • note that f(i) is a function of the weights w • f(i) = w . x’(i) • thus, E is also a function the weights, i.e., E(w)

Gradient Descent Learning Rule • Weight update rule: • wj <- wj + h ( t(i) - f(i) ) xj(i) • t(i) and f(i) are the target and weighted sum (respectively) for the ith example • wj is the jth input weight • xj(i) is the jth input feature value, for the ith example • h is called the learning rate: a small positive number, 0 < h < 1 • An example of how this works: • Say wj and xj(i) are both positive: • say t(i) > f(i) => we increase the value of the weight • say t(i) < f(i) => we decrease the value of the weight • h controls how quickly we increase or decrease the weight

Perceptron Training Algorithm • Algorithm outline • initialize the weights (e.g., randomly) • loop through all N examples (this is 1 iteration) • for each example calculate the output • calculate the difference between the output and the target • update each of the d+1 weights using the gradient update rule • after all N examples are gone through • check if the overall error (MSE) has decreased significantly since the previous iteration • if not, then perform another iteration through all N examples • if so, then

Pseudo-code for Perceptron Training • Inputs: d features, N targets (class labels), learning rate h • Outputs: a set of learned weights • Initialize each wj (e.g., to a small random value) • While (termination condition not satisfied) • for i = 1: N • for j=0: d • deltaw = h ( t(i) - f(i) ) xj(i) • wj = wj + deltaw • end • calculate termination condition • end

Comments on Incremental Learning • will converge to a minimum of E under certain assumptions • if the learning rate is small enough • converges even if the data are not linearly separable • not guaranteed to minimize classification accuracy (but usually will in practice) • typical learning rates: from 0.01 to 0.00001 • can depend on the particular data set • can be adapted (adjusted) automatically during learning • termination conditions: • e.g., when either Accuracy or Mean Squared Error hardly change from one iteration to the next (one iteration of the while loop) • e.g., change in mean square error < 0.0001

Summary of Perceptron Learning Algorithms • “Error Correcting” Perceptron Training rule: • “error correction”, works with thresholded output • guaranteed to converge to zero error if examples are linearly separable • otherwise no guarantee of convergence • “Gradient Descent” Perceptron Training • Incremental version (this is the version for Assignment 3): • updates the weights after visiting each example: cycles through the examples • one iteration is defined as one pass through all N examples (N weight updates) • Batch version (we are not using this) • updates the weights after going through all the examples

MATLAB Demo of Perceptron Learning • sampledata1: linearly separable • sampledata2: not linearly separable • Illustration of • different learning rates • different initialization methods • different convergence criteria • Visualization of • decision boundaries • MSE and Accuracy as a function of iteration number

Assignment function [outputs] = perceptron(weights,data) % function [outputs] = perceptron(weights,data) % % brief description of what the function does % ...... % Your Name, ICS 175A, date % % Inputs % weights: 1 x (d+1) row vector of weights % data: N x (d+1) matrix of training data % % Outputs % outputs: N x 1 vector of perceptron outputs -------- Your code goes here -------

Assignment function [cerror, mse] = perceptron_error(weights,data,targets) % function [cerror, mse] = perceptron_error(weights,data,targets) % % brief description of what the function does % ...... % Your Name, ICS 175A, date % % Inputs % weights: 1 x (d+1) row vector of weights % data: N x (d+1) matrix of training data % targets: N x 1 vector of target values (+1 or -1) % % Outputs % cerror: the percentage of examples misclassified (between 0 and 100) % mse: the mean-squared error (sum of squared errors divided by N) -------- Your code goes here -------

Part IV Perceptron Learning

Part IV Perceptron Learning

Presentation Transcript

Supervised Learning: Linear Perceptron NN

Perceptron Learning Rule

Part IV

PART IV

Part IV

PART IV

GEOMETRY IN PERCEPTRON LEARNING

Part IV

Part IV

Part iv

Advanced Perceptron Learning

Quadratic Perceptron Learning with Applications

Part IV

Part IV

Perceptron Learning

Lecture 8. Learning (V): Perceptron Learning

Perceptron Learning Demonstration

Lecture 8. Learning (V): Perceptron Learning

Perceptron Learning Rule