270 likes | 428 Views
IE 585. Backpropagation Networks. BP Basics. supervised can use multiple layers of “hidden” neurons most common training procedure generally used fully connected perceptron architecture also called the generalized delta rule, error backpropagation or back error propagation
E N D
IE 585 Backpropagation Networks
BP Basics • supervised • can use multiple layers of “hidden” neurons • most common training procedure • generally used fully connected perceptron architecture • also called the generalized delta rule, error backpropagation or back error propagation • The networks that get trained like this are sometimes called Multilayer Perceptrons (MLPs) • good for continuous and binary inputs and outputs • theoretic universal approximator for non linear transfer functions
More BP Basics • learns through gradient descent down the error surface • error signal (generally squared error) is propagated back through the weights to adjust them in the direction of greatest error decrease • must have continuous, differentiable transfer function (NO step functions!) • iterative training that usually requires many passes (epochs) through the training set • generally uses sigmoid transfer function and normalizes input/output from 0 to 1
Origins • Extension of LMS (Widrow and Hoff) • Difficulty - how to adjust weights in the hidden layer? • Paul Werbos, Ph.D. dissertation, Harvard, 1974 • Parker, Internal Report, MIT, 1985 • Popularized by McClelland and Rumelhart (PDP Group), 1986
What is a universal approximator? • can theoretically approximate any relationship to any given degree of precision • in practice, this capability is limited by: • finite and imperfect samples • imperfect network training • missing input (independent) variables
Advantages of BP • works in a wide variety of applications • good theoretic approximation properties • very reliable in both training and operation (testing) • fairly straightforward to understand • lots of software available
Applications of BP • Function approximation • Pattern Classification
Drawbacks of BP • training can be very slow - requiring a large number of iterations • training can stall in a local minimum or saddle point area • must choose number of hidden neurons and number of hidden layers (practically, this is 1 or 2) • can be sensitive to both overparameterization (overfitting) and overtraining
Training Idea - Gradient Descent saddle point e r r o r learning rate, step size gradient local minimum global minimum weight values
Goal of BP • Balance the ability to respond correctly to the input patterns that are used for training (memorization) and the ability to give reasonable (good) responses to input that is similar, but not identical, to that used in training (generalization).
Overview of Training • select architecture, transfer function, learning rate, momentum rates • randomize weights to small +/- values • randomize order of training set, normalize • present a training pattern • calculate output error and propagate back through output weight matrix • propagate back through hidden layer(s) weight matrix • repeat until the stopping criterion is reached
Possible Stopping Criteria • Total number of epochs • Total squared error less than a threshold • Weights are stable (∆w’s are small)
Notation x t=target output y Output Layer wweight matrix z Hidden Layer vweight matrix Input Layer
Sigmoid Transfer Function Binary Transfer Function: y=1/(1+exp(-(wx))) Bipolar Transfer Function: y= (1-exp(-(wx))) / (1+exp(-(wx))) y 1 0 (wx) y 1 (wx) -1
w’s 1 2 v . . . x z k y
Momentum Momentum smooths descent down the error surface and helps prevent weights from getting “stuck”.
Other BP Variations • Learning rate - dynamic • Weight updating - batch (1/epoch), continuous (1/training vector) • Use 2nd derivative info (Hessian matrix) • Change gradient descent - genetic algorithms or other optimization methods • Training vector presentation - random, weighted • Pruning unneeded connections
More BP Alterations • Fully connected vs partially connected (“articulated”) • Connections spanning more than 1 layer • Functional links - input functions of independent variables • Hierarchies of nets - nets feeding nets • Committee nets - multiple nets working together either for consensus or by partitioning the domain space to act as experts
Main BP Issues • Choosing number of hidden neurons • Overtraining and overfitting - poor generalization • Covering domain entirely and equally - training set adequacy • Validation - testing set adequacy • Identifying redundant and/or misleading inputs • Black box aspect
Overtraining e r r o r testing set training set epochs
BP Web Sites • http://www.shef.ac.uk/psychology/gurney/notes/l4/l4.html- some basic notes with drawings • http://neuron.eng.wayne.edu/- java applets, cool, interactive site • http://www2.psy.uq.edu.au/~brainwav/Manual/BackProp.html- more notes and drawings