1 / 27

IE 585

IE 585. Backpropagation Networks. BP Basics. supervised can use multiple layers of “hidden” neurons most common training procedure generally used fully connected perceptron architecture also called the generalized delta rule, error backpropagation or back error propagation

gary
Download Presentation

IE 585

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IE 585 Backpropagation Networks

  2. BP Basics • supervised • can use multiple layers of “hidden” neurons • most common training procedure • generally used fully connected perceptron architecture • also called the generalized delta rule, error backpropagation or back error propagation • The networks that get trained like this are sometimes called Multilayer Perceptrons (MLPs) • good for continuous and binary inputs and outputs • theoretic universal approximator for non linear transfer functions

  3. More BP Basics • learns through gradient descent down the error surface • error signal (generally squared error) is propagated back through the weights to adjust them in the direction of greatest error decrease • must have continuous, differentiable transfer function (NO step functions!) • iterative training that usually requires many passes (epochs) through the training set • generally uses sigmoid transfer function and normalizes input/output from 0 to 1

  4. Origins • Extension of LMS (Widrow and Hoff) • Difficulty - how to adjust weights in the hidden layer? • Paul Werbos, Ph.D. dissertation, Harvard, 1974 • Parker, Internal Report, MIT, 1985 • Popularized by McClelland and Rumelhart (PDP Group), 1986

  5. What is a universal approximator? • can theoretically approximate any relationship to any given degree of precision • in practice, this capability is limited by: • finite and imperfect samples • imperfect network training • missing input (independent) variables

  6. Advantages of BP • works in a wide variety of applications • good theoretic approximation properties • very reliable in both training and operation (testing) • fairly straightforward to understand • lots of software available

  7. Applications of BP • Function approximation • Pattern Classification

  8. Drawbacks of BP • training can be very slow - requiring a large number of iterations • training can stall in a local minimum or saddle point area • must choose number of hidden neurons and number of hidden layers (practically, this is 1 or 2) • can be sensitive to both overparameterization (overfitting) and overtraining

  9. Training Idea - Gradient Descent saddle point e r r o r learning rate, step size gradient local minimum global minimum weight values

  10. Goal of BP • Balance the ability to respond correctly to the input patterns that are used for training (memorization) and the ability to give reasonable (good) responses to input that is similar, but not identical, to that used in training (generalization).

  11. Generalization

  12. Overview of Training • select architecture, transfer function, learning rate, momentum rates • randomize weights to small +/- values • randomize order of training set, normalize • present a training pattern • calculate output error and propagate back through output weight matrix • propagate back through hidden layer(s) weight matrix • repeat until the stopping criterion is reached

  13. Possible Stopping Criteria • Total number of epochs • Total squared error less than a threshold • Weights are stable (∆w’s are small)

  14. Notation x t=target output y Output Layer wweight matrix z Hidden Layer vweight matrix Input Layer

  15. Sigmoid Transfer Function Binary Transfer Function: y=1/(1+exp(-(wx))) Bipolar Transfer Function: y= (1-exp(-(wx))) / (1+exp(-(wx))) y 1 0 (wx) y 1 (wx) -1

  16. Recall LMS Rule

  17. Derivation of BP Algorithm

  18. w’s 1 2 v . . . x z k y

  19. Momentum Momentum smooths descent down the error surface and helps prevent weights from getting “stuck”.

  20. BP Example

  21. Other BP Variations • Learning rate - dynamic • Weight updating - batch (1/epoch), continuous (1/training vector) • Use 2nd derivative info (Hessian matrix) • Change gradient descent - genetic algorithms or other optimization methods • Training vector presentation - random, weighted • Pruning unneeded connections

  22. More BP Alterations • Fully connected vs partially connected (“articulated”) • Connections spanning more than 1 layer • Functional links - input functions of independent variables • Hierarchies of nets - nets feeding nets • Committee nets - multiple nets working together either for consensus or by partitioning the domain space to act as experts

  23. Main BP Issues • Choosing number of hidden neurons • Overtraining and overfitting - poor generalization • Covering domain entirely and equally - training set adequacy • Validation - testing set adequacy • Identifying redundant and/or misleading inputs • Black box aspect

  24. Overfitting

  25. Overtraining e r r o r testing set training set epochs

  26. BP Web Sites • http://www.shef.ac.uk/psychology/gurney/notes/l4/l4.html- some basic notes with drawings • http://neuron.eng.wayne.edu/- java applets, cool, interactive site • http://www2.psy.uq.edu.au/~brainwav/Manual/BackProp.html- more notes and drawings

More Related