Understanding Neural Networks: Perceptrons and Gradient Descent

20.5 Nerual Networks Thanks: Professors Frank Hoffmann and Jiawei Han, and Russell and Norvig

Biological Neural Systems • Neuron switching time : > 10-3 secs • Number of neurons in the human brain: ~1010 • Connections (synapses) per neuron : ~104–105 • Face recognition : 0.1 secs • High degree of distributed and parallel computation • Highly fault tolerent • Highly efficient • Learning is key

Excerpt from Russell and Norvig

A Neuron ak Wkj • Computation: • input signals  input function(linear)  activation function(nonlinear)  output signal output inj aj Input links output links å ai = output(inj) j

x1 x2 xn Part 1. Perceptrons: Simple NN inputs weights w1 output activation w2  y . . . q a=i=1n wi xi wn Xi’s range: [0, 1] 1 if a q y= 0 if a< q {

Decision Surface of a Perceptron 1 1 Decision line w1 x1 + w2 x2 = q x2 w 1 0 0 0 x1 1 0 0

Linear Separability x2 w1=? w2=? q= ? w1=1 w2=1 q=1.5 0 1 0 1 x1 x1 1 0 0 0 Logical XOR Logical AND

x1 x2 xn Threshold as Weight: W0 q=w0 1 if a 0 y= 0 if a<0 x0=-1 w1 w0 w2  y . . . a= i=0n wi xi wn {

Training the Perceptron p742 • Training set S of examples {x,t} • x is an input vector and • t the desired target vector • Example: Logical And S = {(0,0),0}, {(0,1),0}, {(1,0),0}, {(1,1),1} • Iterative process • Present a training example x , compute network output y , compare output y with target t, adjust weights and thresholds • Learning rule • Specifies how to change the weights w and thresholds q of the network as a function of the inputs x, output y and target t.

Perceptron Learning Rule • w’=w + a (t-y) x wi := wi + Dwi = wi + a (t-y) xi (i=1..n) • The parameter a is called the learning rate. • In Han’s book it is lower case L • It determines the magnitude of weight updates Dwi . • If the output is correct (t=y) the weights are not changed (Dwi =0). • If the output is incorrect (t  y) the weights wi are changed such that the output of the Perceptron for the new weights w’i is closer/further to the input xi.

Perceptron Training Algorithm Repeat for each training vector pair (x,t) evaluate the output y when x is the input if yt then form a new weight vector w’ according to w’=w + a (t-y) x else do nothing end if end for Until y=t for all training vector pairs or # iterations > k

Perceptron Convergence Theorem • The algorithm converges to the correct classification • if the training data is linearly separable • and learning rate is sufficiently small • If two classes of vectors X1 and X2 are linearly separable, the application of the perceptron training algorithm will eventually result in a weight vector w0, such that w0 defines a Perceptron whose decision hyper-plane separates X1 and X2 (Rosenblatt 1962). • Solution w0 is not unique, since if w0 x =0 defines a hyper-plane, so does w’0= k w0.

Experiments

x1 x2 xn Perceptron Learning from Patterns w1 w2  . . . wn weights (trained) fixed Input pattern Association units Summation Threshold Association units (A-units) can be assigned arbitrary Boolean functions of the input pattern.

Part 2. Multi Layer Networks Output vector Output nodes Hidden nodes Input nodes Input vector

Gradient Descent Learning Rule • Consider linear unit without threshold and continuous output o (not just –1,1) • Output=oj=-w0 + w1 x1 + … + wn xn • Train the wi’s such that they minimize the squared error • Error[w1,…,wn] = ½ jD (Tj-oj)2 where D is the set of training examples

x1 x2 xn Neuron with Sigmoid-Function inputs weights w1 output activation w2  o . . . a=i=1n wi xi wn Output=o=s(a) =1/(1+e-a)

x1 x2 xn Sigmoid Unit x0=-1 w1 w0 a=i=0n wi xi o=(a)=1/(1+e-a) w2  o . . . (x) is the sigmoid function: 1/(1+e-x) wn d(x)/dx= (x) (1- (x)) • Derive gradient decent rules to train: • one sigmoid function • E/wi = -j(Tj-O) o(1-o) xij • derivation: see next page

Explantion: Gradient Descent Learning Rule yj wi = a Ojp(1-Ojp) (Tjp-Ojp) xip wji xi activation of pre-synaptic neuron learning rate error djof post-synaptic neuron derivative of activation function

(w1,w2) (w1+w1,w2 +w2) Gradient Descent: Graphical D={<(1,1),1>,<(-1,-1),1>, <(1,-1),-1>,<(-1,1),-1>}

Perceptron vs. Gradient Descent Rule • Perceptron rule w’i = wi + a (t-o) xi derived from manipulation of decision surface. • Gradient descent rule w’i = wi + a (1-y) (t-y) xi derived from minimization of error function E[w1,…,wn] = ½ p (t-y)2 by means of gradient descent.

Understanding Neural Networks: Perceptrons and Gradient Descent

Understanding Neural Networks: Perceptrons and Gradient Descent

Presentation Transcript

20.4 20.5 20.6

Extreme Networks Optimized Networks

Equilibrium II 15.6 – Using Keq 15.7 – Le Chậtelier’s Principle 19.7 20.5

20.5 Generators

Computer Networks: Wireless Networks

An Intro to Nerual Networks

20.5:

Name _______Chem. Pd 5 Lesson 20.5 May 18, 2011

Ultra-broadband 20.5 – 31 GHz monolithically-integrated CMOS power amplifier

20.5 Data Cubes

POSS: 5  at r=20.5

Intake and Output Unit 20.5 pages 728-733

20.5 Conclusion

Environmental Monitoring Market Projected to Reach $20.5 Billion by 2020

IOT in Manufacturing Market to Exceed US$ 20.5 Billion by 2022

20.5 Conclusion

20.5 Stacks