CS 9633 Machine Learning

CS 9633Machine Learning Neural Networks Adapted from notes by Tom Mitchell http://www-2.cs.cmu.edu/~tom/mlbook-chapter-slides.html Computer Science Department CS 9633 Machine Learning

Neural networks • Practical method for learning • Real-valued functions • Discrete-valued functions • Vector-valued functions • Robust in the presence of noise • Loosely based on biological model of learning Computer Science Department CS 9633 Machine Learning

Back propagation Neural Networks • Assume fixed structure of network • Directed graph (usually acyclic) • Learning consists of choosing weights for edges Computer Science Department CS 9633 Machine Learning

Characteristics of Back Propagation Problems • Instances represented by many attribute value pairs • Target functions • Discrete-valued • Real-valued • Vector-valued • Instances may contain errors • Long training times are acceptable • Fast evaluation of function may be required • Not important that people understand the learned function Computer Science Department CS 9633 Machine Learning

Perceptrons • Basic unit of many neural networks • Basic operation • Input: vector of real-values • Calculates a linear combination of inputs • Output • 1 if result is greater than some threshold • -1 otherwise Computer Science Department CS 9633 Machine Learning

A perceptron X0=1 x1 w0 w1 w2 x2 S x3 w3 Threshold Processor Summation Processor .. wn xn Computer Science Department CS 9633 Machine Learning

Notation Perceptron function Vector form of perceptron function Computer Science Department CS 9633 Machine Learning

Learning a perceptron • Learning consists of choosing values for n weights • Space H of candidate hypotheses Computer Science Department CS 9633 Machine Learning

Representational Power of Perceptrons • A perceptron represents a hyperplane decision surface in n-dimensional space of instances. • Output of a 1 for instances on one side of the plane and -1 for the other side of the plane • Equation for decision hyperplane • Sets of instances that can be separated by a hyperplane are said to be linearly separable Computer Science Department CS 9633 Machine Learning

Linearly Separable Pattern Classification Computer Science Department CS 9633 Machine Learning

Non-Linearly Separable Pattern Classification Computer Science Department CS 9633 Machine Learning

The Kiss of Death • 1969: Marvin Minsky and Seymour Papert proved that the perceptron had computational limits. Statement: “The perceptron has many features which attract attention: its linearity, its intriguing learning theorem...there is no reason to believe that any of these virtues carry over to the many-layered version. Nevertheless, we consider it to be an important research problem to elucidate (or reject) our intuitive judgment that the extension is sterile” Computer Science Department CS 9633 Machine Learning

Boolean functions • Perceptron can be used to represent the following Boolean functions • AND • OR • Any m-of-n function • NOT • NAND (NOT AND) • NOR (NOT OR) • Every Boolean function can be represented by a network of interconnected units based on these primitives • Two levels is enough Computer Science Department CS 9633 Machine Learning

Revival • 1982: John Hopfield responsible for revival • 1987: First IEEE conference on neural networks. Over 2000 attended. • And the rest is history! Computer Science Department CS 9633 Machine Learning

Perceptron Training • Initialize weight vector with random weights • Apply the perceptron to the training example • Modify the perceptron weights whenever an example is misclassified using perceptron training rule. • Repeat Computer Science Department CS 9633 Machine Learning

Characteristics of Perceptron Training Rule • Guaranteed to converge within a finite number of applications of the rule to a weight vector that correctly classifies all training examples if: • Training examples are linearly separable • The learning rate is acceptably small Computer Science Department CS 9633 Machine Learning

Gradient Descent and the Delta Rule • Designed to converge toward the best-fit approximation of the target concept if the instances are not linearly separable. • Searches the hypothesis space for possible weight vectors to find the weights that best fit the training data • Serves as a basis for backpropagation neural networks Computer Science Department CS 9633 Machine Learning

Training task • Task of training an linear unit without a threshold • Training error (minimization task) • E is a function of w Computer Science Department CS 9633 Machine Learning

Hypothesis Space Computer Science Department CS 9633 Machine Learning

Derivation of the Gradient Descent Learning Rule • Derivation is on page 91-92 of text • The derivative of the error gives the direction of steepest ascent. The negative is the direction of steepest descent. • The derivative gives a very nice, intuitive learning rule. Computer Science Department CS 9633 Machine Learning

Gradient-Descent (training_examples, ) Initialize each wito some small random value Until the termination condition is met Do Initialize each wi to zero For each <x,t> in training_examples Do Input the instance x to the unit and compute the output o For each linear unit weight wi Do wi = wi +  (t - o) xi For each linear unit weight wi Do wi wi + wi

Gradient Ascent • Useful for very large of infinite hypothesis space • Can be applied if • Hypothesis space contains continuously parameterized hypothesis (e.g. weights) • The error can be differentiated with respect to the hypothesis parameters Computer Science Department CS 9633 Machine Learning

Practical Difficulties with Gradient Descent • Converging to a local minimum can sometimes be quite slow • If there are multiple local minima, there is no guarantee the procedure will find the global minimum Computer Science Department CS 9633 Machine Learning

Stochastic Gradient Descent • Also called incremental gradient descent • Tries to address practical problems with gradient descent • In gradient descent, the error is computed for all of the training examples and the weights are updated after all training examples have been presented • Stochastic gradient descent updates the weights incrementally based on the error with each example Computer Science Department CS 9633 Machine Learning

Stochastic-Gradient-Descent (training_examples, ) Initialize each wito some small random value Until the termination condition is met Do Initialize each wi to zero For each <x,t> in training_examples Do Input the instance x to the unit and compute the output o For each linear unit weight wi Do wi = wi +  (t - o) xi

Standard versus Stochastic Gradient Descent Computer Science Department CS 9633 Machine Learning

Comparison of Learning Rules Computer Science Department CS 9633 Machine Learning

Multilayer Networks and Backpropagation O1 O2 I0 H0 I1 H1 I2 H2 I3 Output Layer Input Layer Hidden Layer Computer Science Department CS 9633 Machine Learning

Mutlilayer design • Need a unit whose • Output is a non-linear function of inputs • Output is differentiable function of its inputs • Choices • Use a unit like a perceptron that computes a linear combination of inputs • Applies a threshold to the result that is smoothed and differentiable Computer Science Department CS 9633 Machine Learning

Sigmoid Threshold Unit X0=1 x1 w0 w1 w2 x2 S x3 w3 Summation Processor .. Threshold Processor wn xn Computer Science Department CS 9633 Machine Learning

BACKPROPAGATION(training_examples, , nin,nout,nhidden) Create a feed-forward network with nin input units, nhiddenhidden units, and nout output units. Initialize each wito some small random value Until the termination condition is met Do For each <x,t> in training_examples Do Propagate the input forward through the network: 1. Input the instance x to the network and compute the output ou of every unit u in the network. Propagate the errors backward through the network: 2. For each network output unit k, calculate its error term k 3. For each hidden unit h, calculate its error term h 4. Update each network weight wji

Termination Conditions • Fixed number of iterations • Error on training examples falls below threshold • Error on validation set meets some criteria Computer Science Department CS 9633 Machine Learning

Adding Momentum • A variation on backpropagation • Makes the weight update on one iteration dependent on the update on the previous iteration • Keeps movement going in the “right” direction. • Can sometimes solve problems with local minima and enable faster convergence Computer Science Department CS 9633 Machine Learning

O1 O2 H2 H3 General Acyclic Network Structure H1 I1 I2 I3 Output Layer Input Layer Hidden Layer Computer Science Department CS 9633 Machine Learning

Derivation of Backpropagation Rule • See section 4.5.3 in the text Computer Science Department CS 9633 Machine Learning

Convergence and Local Minima • Error surface may contain many local minima • Algorithm is only guaranteed to converge toward some local minimum in E • In practice, it is a very effective function approximation method. • Problem with local minima is often not encountered • Local minimum with respect to one weight is often counter-balanced by other weights • Initially, with weights near 0, the function represented is nearly linear in its inputs Computer Science Department CS 9633 Machine Learning

Methods for Avoiding Local Minima • Add a momentum term • Use stochastic gradient descent • Train multiple networks • Select best • Use committee machine Computer Science Department CS 9633 Machine Learning

Representational Power of Feed Forward NNs • Boolean functions • Any Boolean function can be represented with 2-layer neural network. • Scheme for arbitrary Boolean function • For each possible input vector, create distinct hidden unit and set its weights so it activates iff this specific vector is input • OR all of these together Computer Science Department CS 9633 Machine Learning

Representational Power of Feed Forward NNs • Continuous Functions • Every bounded continuous function can be approximated with arbitrarily small error by a network with two layers of units • Sigmoid units at hidden layer • Unthresholded linear units at output layer • Number of hidden units depends on the function to be approximates Computer Science Department CS 9633 Machine Learning

Representational Power of Feed Forward NNs • Arbitrary Functions • Any function can be approximated to arbitrary accuracy by a network with 3 layers of units. • Two hidden layers use sigmoid units unthresholded linear units at output layer • Number of units needed at each layer is not known in general Computer Science Department CS 9633 Machine Learning

Hypothesis Search Space and Inductive Bias • Every set of network weights is a different hypothesis • Hypothesis space is continuous • Continuous space and E differentiable with respect to weights gives useful organization of search space by gradient descent • Inductive bias is • Defined by interaction of gradient descent search and weight space • Roughly characterized as smooth interpolation between data points Computer Science Department CS 9633 Machine Learning

Hidden Layer Representations • Backprop can learn useful intermediate representations at the hidden layer • Defines new hidden layer features that are not explicit in the input representation, but captures relevant properties of input instances Computer Science Department CS 9633 Machine Learning

Generalization, Overfitting, and Stopping Criterion • Using error on test examples as stopping criterion is bad idea • Backprop is prone to overfitting • Why does overfitting occur in later iterations, but not earlier? Computer Science Department CS 9633 Machine Learning

Avoiding overfitting • Weight decay • Decrease weights by small factor during each iteration • Stay away from complex surfaces • Validation Data • Train with training set • Get error with validation set • Keep best weights so far on validation data • Cross-validation to determine best number of iterations Computer Science Department CS 9633 Machine Learning

CS 9633 Machine Learning

CS 9633 Machine Learning

Presentation Transcript

CS 446: Machine Learning

CS 9633 Machine Learning Decision Tree Learning

CS 9633 Machine Learning

CS 9633 Machine Learning Feature Selection

CS 60050 Machine Learning

CS 9633 Machine Learning Support Vector Machines

CS 446: Machine Learning

CS 9633 Machine Learning Concept Learning

CS 478 - Machine Learning

CS 446: Machine Learning

CS 9633 Machine Learning Explanation Based Learning

CS 6243 Machine Learning

CS 6243 Machine Learning

CS 9633 Machine Learning k-nearest neighbor

CS 9633 Machine Learning Inductive-Analytical Methods

CS 446: Machine Learning

CS 446: Machine Learning

CS 60050 Machine Learning

CS 536: Machine Learning

CS 782 Machine Learning

CS 478 - Machine Learning