1 / 27

CS 621 Artificial Intelligence Lecture 25 – 14/10/05 Prof. Pushpak Bhattacharyya

CS 621 Artificial Intelligence Lecture 25 – 14/10/05 Prof. Pushpak Bhattacharyya Training The Feedforward Network; Backpropagation Algorithm. Multilayer Feedforward Network. - Needed for solving problems which are not linearly separable. - Hidden layer neurons: assist computation. …….

lidia
Download Presentation

CS 621 Artificial Intelligence Lecture 25 – 14/10/05 Prof. Pushpak Bhattacharyya

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 621 Artificial Intelligence Lecture 25 – 14/10/05 Prof. Pushpak Bhattacharyya Training The Feedforward Network; Backpropagation Algorithm

  2. Multilayer Feedforward Network - Needed for solving problems which are not linearly separable. - Hidden layer neurons: assist computation.

  3. …….. Output layer …….. Hidden layer …….. …….. Input layer Forward connection; no feedback connection

  4. Gradient Descent Rule j Wji fed feeding ΔWji α - δE/ δWji i P M E = error = ½ ΣΣ( tm – om) 2 p=1 m=1 TOTAL SUM SQUARE ERROR(TSS)

  5. Gradient Descent For a Single Neuron y n Net input = Σ WiXi i=0 …. W0 = 0 Wn Wn-1 Xn X0 = -1 Xn-1

  6. y= f(net) Characteristic function f = sigmoid = 1 / ( 1+ e-net ) df f = = f(1-f) dnet y net

  7. α Y = 0 …. observed target Wn ΔWi - δE/ δWi E = ½( t- o)2 W0 Wn-1 Xn X0 Xn-1

  8. α W = <Wn, ……, W0> randomly initialized ΔWi - δE/ δWi = - ηδE/ δWi , ηis the learning rate 0 <= η <=1

  9. E ΔWi = - ηδE / δWi δE / δWi =δ(1/2(t - o)2) / δWi = (δE / δo)*(δo/ δWi ); chain rule = - (t - o) * (δo / δnet)* (δnet/ δWi)

  10. o net δo / δnet= δ f(net)/ δnet = f(net) = f ( 1 - f ) = o ( 1 - o )

  11. y W …. …. δnet/ δWi = xi Wn Wi W0 Xn X0 Xi n net = ΣWiXi i = 0

  12. E = ½ (t - o)2 ΔWi = η (t - o) (1 - o) o Xi o δE / δo δnet / δWi W …. …. δf / δnet Wn Wi W0 Xn X0 Xi

  13. o …. …. Wn Wi W0 Xn X0 Xi E = ½( t - o) 2 ΔWi = η (t - o) (1 - o) o Xi Obs: Xi = 0 , ΔWi = 0 If Xi is more, so is the ΔWi BLAME/CREDIT ASSIGNMENT

  14. More the difference ( t – o ), more is Δw. If( t – o ) is +ve , so is Δw If( t – o ) is –ve, so is Δw

  15. If o is 0/1 , Δw = 0 o is 0/1 when net = - ∞ or + ∞ Δw  0 because of o  0/1. It is called “saturation” or “paralysis’ of the network. It happens due to sigmoid. o 1 net

  16. k 1. y = k / (1+e–x) k Solution to network saturation 2. y = tanh(x) x - k

  17. Solution to network saturation (Contd) 3. Scale the inputs Reduced the values Problem of floating/fixed number representation error.

  18. ΔWi = η ( t - o) o ( 1 – o) Xi Smaller η smaller ΔW

  19. E op. pt Wi Start with large η, gradually decrease it. Global minimum

  20. Gradient Descent training is typically slow: First parameter: η; learning rate Second parameter: β; Momentum factor0 <= β <= 1

  21. Momentum Factor Use a part of previous weight Change into the current weight change. (ΔWi)n = η (t - o) o (1 – o) Xi + β(ΔWi)n-1 Iteration

  22. Effect of β If (ΔWi)n and (ΔWi)n-1 are of same sign then (ΔWi)n is enhanced. If (ΔWi)n and (ΔWi)n-1 are of opposite sign then effective (ΔWi)n is reduced.

  23. A E P Q op. pt R S W Accelerates movement at A. 2) Dampens oscillation near global minimum.

  24. Pure gradient descent momentum (ΔWi)n = η (t - o) o (1 – o) Xi + β(ΔWi )n-1 Relation between η and β ?

  25. Relation between η and β η >> β ? η << β ? (ΔWi)n = η (t - o) o (1 – o) Xi + β(ΔWi)n-1

  26. Relation between η and β (Contd) If η << β (ΔWi)n = β(ΔWi)n-1 recurrence Relation (ΔWi )n = β(ΔWi)n-1 = β[β(ΔWi)n-2] = β2[β(ΔWi)n-3] . . . = βn(ΔWi)0

  27. Relation between η and β (Contd) β is typically 1/10 th of η Empirical Practice If β is very large compared to η, no effect of output error, input or neuron characteristics is felt. Also (ΔW) goes on decreasing since β is a fraction.

More Related