Lectures 3 4 linear machine learning algorithms
Download
1 / 32

Lectures 34: Linear Machine Learning Algorithms - PowerPoint PPT Presentation


  • 105 Views
  • Updated On :

Lectures 3&4: Linear Machine Learning Algorithms. Dr Martin Brown Room: E1k Email: [email protected] Telephone: 0161 306 4672 http://www.csc.umist.ac.uk/msc/intranet/EE-M016. Lectures 3&4: Outline. Linear classification using the Perceptron Classification problem

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Lectures 34: Linear Machine Learning Algorithms' - ivan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Lectures 3 4 linear machine learning algorithms l.jpg

Lectures 3&4:Linear Machine Learning Algorithms

Dr Martin Brown

Room: E1k

Email: [email protected]

Telephone: 0161 306 4672

http://www.csc.umist.ac.uk/msc/intranet/EE-M016


Lectures 3 4 outline l.jpg
Lectures 3&4: Outline

  • Linear classification using the Perceptron

  • Classification problem

  • Linear classifier and decision boundary

  • Perceptron learning rule

  • Proof of convergence

  • Recursive linear regression using LMS

  • Modelling and recursive parameter estimation

  • Linear models and quadratic performance function

  • LMS and NLMS learning rules

  • Proof of convergence


Lectures 3 4 learning objectives l.jpg
Lectures 3&4: Learning Objectives

  • Understand what classification and regression machine learning techniques are and their differences

  • Describe how linear models can be used for both classification and regression problems

  • Prove convergence of the learning algorithms for linear relationships, subject to restrictive conditions

  • Understand the restrictions of these basic proofs

  • Develop basic framework that will be expanded on in subsequent lectures


Lecture 3 4 resources l.jpg
Lecture 3&4: Resources

  • Classification/Perceptron

  • An introduction to Support Vector Machines and other kernel-based learning methods, N Cristianini, J Shawe-Taylor, CUP, 2000

  • Regression/LMS

  • Adaptive Signal Processing, Widrow & Stearns, Prentice Hall, 1985

  • Many other sources are available (on-line).


What is classification l.jpg
What is Classification?

  • Classification is also known as (statistical) pattern recognition

  • The aim is to build a machine/algorithm that can assign appropriate qualitative labels to new, previously unseen quantitative data using a priori knowledge and/or information contained in a training set. The patterns to be classified are usually groups of measurements/observations, that are believed to be informative for the classification task.

  • Example: Face recognition

Training data: D = {X,y}

Prior

knowledge

Design/

learn

Classifier

m(q,x)

^

Predict

^

Predicted class label: y

New pattern: x


Classification training data l.jpg
Classification Training Data

  • To supply training data for a classifier, examples must be collected that contain both positive (examples of the class) and negative (examples of other classes) instances. These are qualitative target class values and are stored as +1 and -1, for the positive and negative instances respectively. Generated by expert or by observation.

  • The quantitative input features should be informative

  • The training set should contain enough examples to be able to build statistically significant decisions

How to encode qualitative target and input features?


Bayes class priors l.jpg
Bayes Class Priors

  • Classification is all about decision making using the concept of “minimum risk”

  • Imagine that the training data contains 100 examples, 70 of them are class 1 (c1), 30 are class 2 (c2)

    • If I have to decide which class an unknown example belongs to, which decision is optimal?

    • Errors if decision is class 1: p(c1) =

    • Errors if decision is class 2: p(c2) =

    • Minimum risk decision is:

  • p(c1) & p(c2) are known as the Bayes priors, they represent the baseline performance for any classifier. They are derived from the training data as simple percentages


Structure of a linear classifier l.jpg
Structure of a Linear Classifier

  • Given a set of quantitative features x, a linear classifier has the form:

  • The sgn() function is used to produce the qualitative class label (+/-1)

  • The class/decision boundary is determined when:

  • This is an (n-1)D hyperplane in feature space.

  • In 2-dimensional feature space:

  • How does the sign and magnitude of q affect the decision boundary?

x2

+

+

+

+

+

+

+

+

x1


Simple example fisher s iris data l.jpg
Simple Example: Fisher’s Iris Data

  • Famous example of building classifiers for a problem with 3 types of Iris flowers and 4 measurements about the flower:

  • Sepal length and width

  • Petal length and width

  • 150 examples were collected, 50 from each class

  • Build 3 separate classifiers, one for recognizing examples of each class

  • Data is shown, plotted against last two features, as well as two linear classifiers for the Setosa and Virginica classes

Calculate q in lab 3&4 …


Perceptron linear classifier l.jpg
Perceptron Linear Classifier

  • The Perceptron linear classifier was devised by Rosenblatt in 1956

  • It comprises a linear classifier (as just discussed) and a simple parameter update rule of the form:

    • Cyclically present each training pattern {xk, yk} to the linear classifier

      • When an error (misclassification) is made, update the parameters:

      • where h>0 is the learning rate.

      • The bias term can be included as q0 with an extra feature x0 = 1:

    • Continue until there are no prediction errors

  • Perceptron convergence theorem If the data set is linearly separable, the perceptron learning algorithm will converge to an optimal separator in a finite time


Instantaneous parameter update l.jpg

^

xTq

Instantaneous Parameter Update

  • What does this look like?

  • The parameters are updated to make them more like the incorrect feature vector.

  • After updating:

  • Updated parameters are closer

  • to correct decision

x2, q2

^

Error-driven update:

^

x1, q1

^

y, y

1

0

-1


Perceptron convergence proof preamble l.jpg
Perceptron Convergence Proof Preamble …

  • Basic aim is to minimise the number of mis-classifications:

    • This is generally an NP-complete problem

    • We’ve assumed that there is an optimal solution with 0 errors

  • This is similar to Least Squares recursive estimation:

    • Performance = Si(yi-yi)2 = 4*numberOfErrors

    • Except that the sgn() makes it a non-quadratic optimization problem

  • Updating only when there are errors is the same as:

  • with or without errors

  • Sometimes drawn as a network:

^

“error driven”

parameter

estimation

Repeatedly

cycle through data set D, drawing out each sample {xk, yk}

^

yk

xk

-

+

yk


Convergence analysis of the perceptron i l.jpg
Convergence Analysis of the Perceptron (i)

  • If a linearly separable data set D is repeatedly presented to a Perceptron, then the learning procedure is guaranteed to converge (no errors) in a finite time

  • If the data set is linearly separable, there exists optimal parameters q such that for all i = 1, …, l

    • Note that are also optimal parameter vectors

  • Consider the positive quantity g defined by, such that ||q|| = 1:

    • This is a concept known as the “classification margin”

  • Assume also that the feature vectors are bounded by:


Convergence analysis of the perceptron ii l.jpg
Convergence Analysis of the Perceptron (ii)

  • To show convergence, we need to establish that at the kth iteration, when an error has occurred:

  • Using the update formula:

q2

^

qk

^

qk+1

q

q1

To finish proof, select


Convergence analysis of the perceptron iii l.jpg
Convergence Analysis of the Perceptron (iii)

  • To show this terminates in a finite number of iterations, simply note that:

  • is independent of the current training sample, so the parameter error must decrease by at least this amount at each update iteration. As the initial error is finite, q0 = 0, say, there must exist a finite number of steps before the parameter error is reduced to zero.

  • Note also that a is proportional to the size of the feature vector (R2) and inversely proportional to the size of the margin (g). Both of these will influence the number of update iterations when the Perceptron is learning

^


Example of perceptron i l.jpg
Example of Perceptron (i)

  • Consider modelling the logical AND data using a Perceptron

Is the data linearly separable?

^

^

^

k=0, q = [0.01, 0.1, 0.006]

k=5, q = [-0.98, 1.11, 1.01]

k=18, q = [-2.98, 2.11, 1.01]

x2

x2

x2

x1

x1

x1


Example parameter trajectory ii l.jpg

^

q1,k

^

^

qi,k

q2,k

^

bias q0,k

k: data presentation index

Example: Parameter Trajectory (ii)

Lab exercise:

Calculate by hand the first 4 iterations of the learning scheme


Classification margin l.jpg
Classification Margin

  • In this proof, we assumed that there exists a single, optimal parameter vector.

  • In practice, when the data is linearly separable, there are an infinite number – simply requiring correct classification results in an ill-posed posed problem

  • The classification margin can be defined as the minimum distance of the decision boundary to a point in that class

    • Used in deriving Support Vector Machines

x2

x1

x2

1

0

-1

x1


Classification summary l.jpg
Classification Summary

  • Classification is the task of assigning an object, described by a feature vector, to one of a set of mutually exclusive groups

  • A linear classifier has a linear decision boundary

  • The perceptron training algorithm is guaranteed to converge in a finite time when the data set is linearly separable

  • The final boundary is determined by the initial values and the order of presentation of the data


Definition of regression l.jpg
Definition of Regression

  • Regression is a (statistical) methodology that utilizes the relation between two or more quantitative variables so that one variable can be predicted from the other, or others.

  • Examples:

  • Sales of a product can be predicted by using the relationship between sales volume and amount of advertising

  • The performance of an employee can be predicted by using the relationship between performance and aptitude tests

  • The size of a child’s vocabulary can be predicted by using the relationship between the vocabulary size, the child’s age and the parents’ educational input.


Regression problem visualisation l.jpg

rmse

= s

Regression Problem Visualisation

  • Data generated by

  • Estimate model parameters

  • Predict a real value (fit a curve to the data)

  • Predictive performance

  • average error

+

+

+

+

+

^

^

+

y, y

y

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

x


Probabilistic prediction output l.jpg

y, y

+

+

+

m(y|x) = 12

s(e) = 1.5

12

+

+

+

+

+

x

Probabilistic Prediction Output

  • An output of 12 with rmse/standard deviation = 1.5: Within a small region close to the query point, the average target value was 12 and the standard deviation within that region was 1.5 (variance = 2.25)

m(y|x) = 12

^

2s(e) = 3

95% of the data lies in the range m+/-2s

= [12 +/-2*1.5]

= [9,15]


Structure of a linear regression model l.jpg

Given a set of features x, a linear predictor has the form:

The output is a real-valued, quantitative variable

The bias term can be included as an extra feature x0 = 1. This renames the bias parameter as q0.

Most linear control system models do not explicitly include a bias term, why is this?

Similar to the Toluca example in week 1.

^

y, y

x

Structure of a Linear Regression Model


Least mean squares learning l.jpg
Least Mean Squares Learning

  • Least Mean Squares (LMS) proposed by Widrow 1962

  • This is a (non-optimal) sequential parameter estimation procedure for a linear model:

    • NB, compared to classification, both yk and yk are quantitative variables, so the error/noise signal (yk-yk) is generally non-zero. Similar to the Perceptron, but no threshold on xTq. h is again the positive learning rate.

  • Widely used in filtering/signal processing and adaptive control applications

  • “Cheap” version of sequential/recursive parameter estimation

  • The normalised version (NLMS) was developed by Kaczmarz in 1937

^

^

^


Proof of lms convergence i l.jpg
Proof of LMS Convergence (i)

  • If a noise-free data set containing a linear relationship x->y is repeatedly presented to a linear model, then the LMS algorithm is guaranteed to update the parameters so that they converge to their optimal values, assuming the learning rate is sufficiently small.

  • Note:

  • Assume there is no measurement noise in the target data

  • Assume the data is generated from a linear relationship

  • Parameter estimation will take an infinite time to converge to the optimal values

  • Rate of convergence and stability depend on the learning rate


Proof of convergence ii l.jpg

q2

^

qk

^

qk+1

q

q1

Proof of Convergence (ii)

  • To show convergence, we need to establish that at the kth iteration, when an error has occurred:

  • Using the update formula:

when


Example lms learning l.jpg
Example: LMS Learning

  • Consider the “target” linear model y = 1 - 2*x, where the inputs are drawn from a normal distribution with zero mean, unit variance

  • Data set consisted of 25 data points, and involved 10 cycles through the data set

  • h=0.1

k=100

y,

y

k=5

^

k=0

x

^

q0

^

^

q

q1

^

q1

^

q0

k


Stability and nlms l.jpg
Stability and NLMS

  • To normalise the LMS algorithm and remove the dependency of h on the input vector size, consider:

    • This learning algorithm is stable for 0<h< 2 (exercise).

  • When h=1, the NLMS algorithm has the property that the error, on that datum, after adaptation is zero, ie:

    • Exercise: prove this.

    • Is this desirable when the target contains (measurement) noise?


Regression summary l.jpg
Regression Summary

  • Regression is a (statistical) technique for predicting real-valued outputs, given a quantitative feature vector

  • Typically, it is assumed that the dependent, target variable is corrupted by Gaussian noise, and this is unpredictable.

  • The aim is then to fit the underlying linear/non-linear signal.

  • The LMS algorithm is a simple, cheap gradient descent technique for updating the linear parameter estimates

  • The parameters will converge to their correct values when the target does not contain any noise, otherwise they will oscillate in a zone around the optimum.

  • Stability of the algorithm depends on the learning rate


Lecture 3 4 summary l.jpg
Lecture 3&4: Summary

  • This lecture has looked at basic (linear) classification and regression techniques

    • Investigated basic linear model structure

    • Proposed simple, “on-line” learning rules

    • Proved convergence for simple environments

    • Discussed the practicality of the machine learning algorithms

  • While these algorithms are rarely used in this form, their structure has strongly influenced the development of more advanced techniques

    • Support vector machines

    • Multi-layer perceptrons

  • which will be studied in the coming weeks


Laboratory 3 4 perceptron lms l.jpg
Laboratory 3&4: Perceptron/LMS

  • Download the irisClassifier.m & iris.mat Matlab files that contain a simple GUI for displaying the Iris data and entering decision boundaries

    • Enter parameters that create suitable decision boundaries for both the Setosa and Virginica classes

    • Which of the three classes are linearly separable?

    • Make sure you can translate between the classifiers’ parameters, q, and the gradient/intercept coordinate systems. Also ensure that the output is +1 (rather than -1) in the appropriate region

  • Download the irisPerceptron.m and perceptron.m Matlab files that contain the Perceptron algorithm for the Iris data

    • Run the algorithm and note how the decision boundary changes when a point is correctly/incorrectly classified

    • Modify the learning rate and note the effect it has on the convergence rate and final values


Laboratory 3 4 perceptron lms ii l.jpg
Laboratory 3&4: Perceptron/LMS (ii)

  • Copy and modify the irisPerceptron.m Matlab file so that it runs on the logical AND and OR classification functions (see slides 16 & 17). Each should contain 2 features and four training patterns. Make sure you can calculate the updates by hand, as required on Slide 17.

  • Create a Matlab implementation of example given in Slide 27 for the LMS algorithm with a simple, single input linear model

    • What values of h causes the LMS algorithm to become unstable?

    • Can this ever happen with the Perceptron algorithm?

  • Modify this implementation to use the NLMS training rule

    • Verify that learning is always stable for 0 < h < 2.

  • Complete the two (pen and paper) exercises on Slide 28.

    • How might this insight be used with the Perceptron algorithm to implement a dynamic learning rate?


ad