1 / 44

Error Backpropagation

Error Backpropagation. All learning algorithms for (layered) feed-forward networks are based on a technique called error backpropagation

hspann
Download Presentation

Error Backpropagation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Error Backpropagation • All learning algorithms for (layered) feed-forward networks are based on a technique called error backpropagation • This is a form corrective supervised learning which consists of two phases. In the first (for-ward) phase the output of each neuron is computed, in the second (backward) phase the partial derivatives of the error function with respect to the weights are computed, where-after the weights are updated Rudolf Mak TU/e Computer Science

  2. Approach • The approach we take • is a minor variation of the one in: R. Rojas, • Neural Networks, Springer, 1996. • applies to general feed-forward networks • allows distinct activation functions for each of • the neurons • uses a graphical method called B-diagrams • to illustrate how partial derivatives of the error • function can be computed Rudolf Mak TU/e Computer Science

  3. General Feed-forward Networks • A general feed-forward network consists of • n input nodes (numbered 1, … , n) • l hidden neurons (numbered n+1, … , n+l) • m output neurons (numbered n+l+1, … , n+l+m) • a set of connections such that the network does • not contain cycles. Hence the hidden neurons • can be topologically sorted, i.e. numbered such • that (i, j) is a connection, iff • i < j and n < j and i < n+l+1. Rudolf Mak TU/e Computer Science

  4. Rudolf Mak TU/e Computer Science

  5. B-diagrams • A B-diagram is a directed acyclic network containing four types of nodes • Fan-in nodes • Fan-out nodes • Product nodes • Function nodes • The forward phase computes function composition, the backward phase computes partial derivatives. Rudolf Mak TU/e Computer Science

  6. B-diagram (fan-in node) Forward phase Backward phase Rudolf Mak TU/e Computer Science

  7. d ( |y ) y 1 1 ( |y ) d 1 y 2 1 2 S d ( |x) x 1 d ( |y ) y n n B-diagram (fan-out node) Forward phase Backward phase Rudolf Mak TU/e Computer Science

  8. d d ( |x) ( |y) x y B-diagram (product node) Forward phase Backward phase Rudolf Mak TU/e Computer Science

  9. d d ( |x) ( |y) x y B-diagram (function node) Forward phase Backward phase Rudolf Mak TU/e Computer Science

  10. g(f(x)) f(x) g’(f(x))f ’(x) g’(f(x)) Chain-rule x 1 (g f)(x) = g(f(x)) (g  f)’(x) = g’(f(x))f ’(x) Rudolf Mak TU/e Computer Science

  11. Remark Note that the product node, the fan-in node, and the function node are all special cases of a more general node for functions with an arbitrary num- ber of arguments that stores all partial derivates. f (x1, x2) Rudolf Mak TU/e Computer Science

  12. Rudolf Mak TU/e Computer Science

  13. Translation scheme • As a first step in the development of the error backpropagation algorithm we show how to translate a general feed-forward net into a B-diagram • Replace each input node by a fan-out node • Replace each edge by a product node • Replace each neuron by a fan-in node, followed • by a function node, followed by a fan-out node Rudolf Mak TU/e Computer Science

  14. Translation of a neuron Note that this translation only captures the activa-tion function and connection pattern of a neuron. The weights are modeled by separate product nodes. Rudolf Mak TU/e Computer Science

  15. Simplifications • The B-diagram of a general feed-forward net can be simplified as follows: • Neurons with a single output do not require a • fan-out node • Neurons with a single input do not require a • fan-in node • Neurons with activation function f(z) = z do not • require a function node • Edges with weight 1 do not require a product • node Rudolf Mak TU/e Computer Science

  16. Backpropagation theorem Let B be the B-diagram of a general feed-forward net N that computes a function F : Rn !R Presenting value xi at the input node i of B and performing the forward phase of each node (in the order indicated by the numbering of the nodes of N) will result in the value F(x) at the output of B. Subsequently presenting value 1 at the output node and performing the backward phase will result in partial derivative ¶F(x) / ¶xi at input i. Rudolf Mak TU/e Computer Science

  17. Error function Consider a general FFN that computes with training set Then the error of training pair q is defined by Rudolf Mak TU/e Computer Science

  18. FFNs that compute Error Functions Hidden neurons Rudolf Mak TU/e Computer Science

  19. X Cut here to create an extra input Error Dependence on Weight wij Rudolf Mak TU/e Computer Science

  20. E(rror)B(ack)P(ropagation) Learning Rudolf Mak TU/e Computer Science

  21. EBP learning (forward phase) Rudolf Mak TU/e Computer Science

  22. EBP learning (backward phase) Rudolf Mak TU/e Computer Science

  23. EBP learning (update phase) Beware: a weight update can only be performed after all errors that depend on that weight have been computed. A separate phase trivially gua- rantees this requirement. Rudolf Mak TU/e Computer Science

  24. Layered version of EBP • To obtain a version of the error backpropagation • algorithm for layered feedforward networks, i.e. • multi-layer perceptrons, we • introduce a layer-oriented node numbering • visit the nodes on a layer by layer basis • introduce vector notation for quantities pertain- • ing to a single layer Rudolf Mak TU/e Computer Science

  25. Layer-oriented Node Numbers • Assume that the nodes of the network can be • organized in r+1 layers, numbered 0, …, r • For 0·s· r+1, let ns denote the number • of nodes in layers 0, …, (s -1). Hence node i • lies in layer s iff ns<i·ns+1 • Renumber the nodes according to the scheme Rudolf Mak TU/e Computer Science

  26. Weight Matrix of Layer s Let Ws be the (ns£ns-1)-matrix defined by Note that for the sake of simplicity we have added zero weights such that there exists a connection between any pair of nodes in successive layers For convenience we write wsij instead of (Ws)ij Rudolf Mak TU/e Computer Science

  27. EBP (forward phase, layered) Rudolf Mak TU/e Computer Science

  28. EBP (backward phase, layered) Rudolf Mak TU/e Computer Science

  29. EBP (update phase, layered) Rudolf Mak TU/e Computer Science

  30. Vector notation For a continuous and differentiable function f : R!R and vector z2Rn for arbitrary dimen-sion n define the n-dimensional vector F (z) by and the diagonal matrix by Rudolf Mak TU/e Computer Science

  31. Forward phase Backward phase EBP (layered and vectorized) Rudolf Mak TU/e Computer Science

  32. Practical Aspects • Convergence improvements • Elementary improvements • Advanced first-order methods • Second order methods • Generalization • Overtraining • Training with cross validation Rudolf Mak TU/e Computer Science

  33. Elementary Improvements • Momentum term • Resilient backpropagation • gradient determines the sign of the weight updates • learning rate increases for stable gradient • learning rate decreases for alternating gradient Rudolf Mak TU/e Computer Science

  34. First-order Methods • Steepest descent: where • is chosen such that is minimal. • Conjugated gradient methods: directions are given by • with suitably chosen. Rudolf Mak TU/e Computer Science

  35. Second-order Methods (derivation) • Consider the Taylor expansion of the error func- • tion around w0 • Ignore third- and higher-order terms and choose • such that is minimal, i.e. Rudolf Mak TU/e Computer Science

  36. (Quasi) Newton methods • Quasi Newton methods use the update rule • with • Fast convergence (Newton’s method requires1 iteration for a quadratic error function) • Solving the above equation is time consuming • Hessian matrix H can be very large Rudolf Mak TU/e Computer Science

  37. Levenberg-Marquardt Methods • LM-methods use update rule • This is a combination of gradient descent and • Newton’s method • If small, then • If large, then Rudolf Mak TU/e Computer Science

  38. Generalization • Generalization addresses the issue how well a • net performs on fresh (not part of the training set) • samples from the population. • Generalization is influenced by three factors: • The architecture of the network • The size of the training set • The complexity of the problem Rudolf Mak TU/e Computer Science

  39. Overtraining • Overtraining is the situation in which the network memorizes the data of the training set, but generalizes poorly. • The size of the training set must be related to the amount of data the network can memorize (i.e. the number of weights). • Vice-versa in order to prevent overtraining the number of weights must be kept in proportion to the size of the training set Rudolf Mak TU/e Computer Science

  40. Cross Validation • To protect against overtraining a technique called • cross-validation can be used. It involves • an additional data set called the validation set • computing the error made by the net on this validation set, while training with the training set • stop training when the error on the validation set starts increasing • Usually the size of the validation set is chosen • roughly halve the size of the training set. Rudolf Mak TU/e Computer Science

  41. Practical Aspects • Preprocessing • Normalization • Decorrelation • Network pruning • Magnitude-based • Optimal brain damage • Optimal brain surgeon Rudolf Mak TU/e Computer Science

  42. Preprocessing • Normalization: • Decorrelation: Rudolf Mak TU/e Computer Science

  43. Pruning • Pruning is a technique to increase network perfor- • mance by elimination (pruning in the strict sense) • or addition (pruning in the broad sense) of neu- • rons and/or connections. Rudolf Mak TU/e Computer Science

  44. Pruning connections • Optimal Brain Damage • Optimal Brain Surgeon Rudolf Mak TU/e Computer Science

More Related