1 / 26

Backpropagation

Backpropagation. Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001. Administration. Questions, concerns?. Classification Percept. x 1. x 2. x 3. x D. 1. …. w D. w 3. w 2. w 1. w 0. net. sum. g. out. squash. Perceptrons.

Download Presentation

Backpropagation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Backpropagation Introduction toArtificial Intelligence COS302 Michael L. Littman Fall 2001

  2. Administration Questions, concerns?

  3. Classification Percept. x1 x2 x3 xD 1 … wD w3 w2 w1 w0 net sum g out squash

  4. Perceptrons Recall that the squashing function makes the output look more like bits: 0 or 1 decisions. What if we give it inputs that are also bits?

  5. A Boolean Function A B C D E F G out 1 0 1 0 1 0 10 0 1 1 0 0 0 10 0 0 1 0 0 1 00 1 0 0 0 1 0 01 0 0 1 1 0 0 01 1 1 1 0 1 0 1 0 0 1 0 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0

  6. 1 1 C 1 0 D Think Graphically Can perceptron learn this?

  7. Ands and Ors out(x) = g(sumk wk xk) How can we set the weights to represent (v1)(v2)(~v7) ? AND wi=0, except w1=10, w2=10, w7=-10, w0=-15 (5-max) How about ~v3 +v4 +~v8? OR wi=0, except w1=-10, w2=10, w7=-10, w0=15 (-5-min)

  8. Majority Are at least half the bits on? Set all weights to 1, w0 to –n/2. A B C D E F G out 1 0 1 0 1 0 11 0 1 1 0 0 0 10 0 0 1 0 0 1 00 1 0 0 0 1 0 00 1 1 1 0 1 0 1 1 0 1 0 1 0 0 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 Representation size using decision tree?

  9. Sweet Sixteen? ab (~a)+(~b) a(~b) (~a)+b (~a)b a+(~b) (~a)(~b) a+b a ~a b ~b 1 0 a = b a exclusive-or b (a  b)

  10. XOR Constraints A B out 0 0 0 g(w0) < 1/2 0 1 1 g(wB+w0) > 1/2 1 0 1g(wA+w0) > 1/2 1 1 0 g(wA+wB+w0) < 1/2 w0 < 0, wA+w0>0, wB+w0>0, wA+wB+2 w0>0, 0 < wA+wB+w0 < 0

  11. 0 1 C 0 1 D Linearly Separable XOR problematic ?

  12. 1 A B 1 -10 10 -10 15 -5 10 net net c2 c1 1 10 10 -15 net out How Represent XOR? A xor B = (A+B)(~A+~B)

  13. Requiem for a Perceptron Rosenblatt proved that a perceptron will learn any linearly separable function. Minsky and Papert (1969) in Perceptrons: “there is no reason to suppose that any of the virtues carry over to the many-layered version.”

  14. Backpropagation Bryson and Ho (1969, same year) described a training procedure for multilayer networks. Went unnoticed. Multiply rediscovered in the 1980s.

  15. x1 x2 x3 xD 1 … W13 W12 W11 … net2i net1i netHi … hid1 hid2 hid g neti U0 U1 out Multilayer Net

  16. Multiple Outputs Makes no difference for the perceptron. Add more outputs off the hidden layer in the multilayer case.

  17. Output Function outi(x) = g(sumj Uji g(sumk Wkj xk)) H: number of “hidden” nodes Also: • Use more than one hidden layer • Use direct input-output weights

  18. How Train? Find a set of weights U, W that minimize sum(x,y) sumi (yi-outi(x))2 using gradient descent. Incremental version (vs. batch): Move weights a small amount for each training example

  19. Updating Weights • Feed-forward to hidden: netj = sumk Wkj xk; hidj = g(netj) • Feed-forward to output: neti = sumj Uji hidj; outi = g(neti) 3. Update output weights: Di = g’(neti) (yi-outi); Uji +=  hidjDi 4. Update hidden weights: Dj= g’(netj) sumi Ujj Di; Wkj +=  xkDj

  20. Multilayer Net (schema) xk Wkj netj Dj hidj Uji Uji Di neti yi outi

  21. Does it Work? Sort of: Lots of practical applications, lots of people play with it. Fun. However, can fall prey to the standard problems with local search… NP-hard to train a 3-node net.

  22. Step Size Issues Too small? Too big?

  23. Representation Issues Any continuous function can be represented by a one hidden layer net with sufficient hidden nodes. Any function at all can be represented by a two hidden layer net with a sufficient number of hidden nodes. What’s the downside for learning?

  24. Generalization Issues Pruning weights: “optimal brain damage” Cross validation Much, much more to this. Take a class on machine learning.

  25. What to Learn Representing logical functions using sigmoid units Majority (net vs. decision tree) XOR is not linearly separable Adding layers adds expressibility Backprop is gradient descent

  26. Homework 10 (due 12/12) • Describe a procedure for converting a Boolean formula in CNF (n variables, m clauses) into an equivalent network? How many hidden units does it have? • More soon

More Related