Artificial neural networks l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 35

Artificial Neural Networks PowerPoint PPT Presentation


  • 123 Views
  • Uploaded on
  • Presentation posted in: General

Artificial Neural Networks. Outline. Biological Motivation Perceptron Gradient Descent Least Mean Square Error Multi-layer networks Sigmoid node Backpropagation. Biological Neural Systems. Neuron switching time : > 10 -3 secs Number of neurons in the human brain: ~10 10

Download Presentation

Artificial Neural Networks

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Artificial neural networks l.jpg

Artificial Neural Networks


Outline l.jpg

Outline

  • Biological Motivation

  • Perceptron

  • Gradient Descent

    • Least Mean Square Error

  • Multi-layer networks

    • Sigmoid node

    • Backpropagation


Biological neural systems l.jpg

Biological Neural Systems

  • Neuron switching time : > 10-3 secs

  • Number of neurons in the human brain: ~1010

  • Connections (synapses) per neuron : ~104–105

  • Face recognition : 0.1 secs

  • High degree of parallel computation

  • Distributed representations


Artificial neural networks4 l.jpg

Artificial Neural Networks

  • Many simple neuron-like threshold units

  • Many weighted interconnections

  • Multiple outputs

  • Highly parallel and distributed processing

  • Learning by tuning the connection weights


Perceptron linear threshold unit l.jpg

x1

x2

xn

Perceptron: Linear threshold unit

x0=1

w1

w0

w2

S

o

.

.

.

i=0n wi xi

wn

1 if i=0nwi xi >0

o(xi)=

-1 otherwise

{


Decision surface of a perceptron l.jpg

x2

+

-

x1

+

-

Xor

Decision Surface of a Perceptron

x2

+

+

+

-

-

x1

+

-

-

Linearly Separable

Theorem: VC-dim = n+1


Perceptron learning rule l.jpg

Perceptron Learning Rule

S sample

xi input vector

t=c(x) is the target value

o is the perceptron output

 learning rate(a small constant ), assume=1

wi = wi + wi

wi =  (t - o) xi


Perceptron algo l.jpg

Perceptron Algo.

  • Correct Output (t=o)

    • Weights are unchanged

  • Incorrect Output (to)

    • Change weights !

  • False Positive (t=1 and o=-1)

    • Add x to w

  • False Negative (t=-1 and o=1)

    • Subtract x from w


Perceptron learning rule9 l.jpg

t=-1

t=1

o=1

w=[0.25 –0.1 0.5]

x2 = 0.2 x1 – 0.5

o=-1

(x,t)=([2,1],-1)

o=sgn(0.45-0.6+0.3)

=1

(x,t)=([-1,-1],1)

o=sgn(0.25+0.1-0.5)

=-1

w=[0.2 –0.2 –0.2]

w=[-0.2 –0.4 –0.2]

(x,t)=([1,1],1)

o=sgn(0.25-0.7+0.1)

=-1

w=[0.2 0.2 0.2]

Perceptron Learning Rule


Perceptron algorithm analysis l.jpg

Perceptron Algorithm: Analysis

  • Theorem: The number of errors of the Perceptron Algorithm is bounded

  • Proof:

  • Make all examples positive

    • change <xi,bi> to <bixi, +1>

  • Margin of hyperplan w


Perceptron algorithm analysis ii l.jpg

Perceptron Algorithm: Analysis II

  • Let mibe the number of errors of xi

    • M=  mi

  • From the algorithm: w=  mixi

  • Let w* be a separating hyperplane


Perceptron algorithm analysis iii l.jpg

Perceptron Algorithm: Analysis III

  • Change in weights:

  • Since w errs on xi , we have wxi <0

  • Total weight:


Perceptron algorithm analysis iv l.jpg

Perceptron Algorithm: Analysis IV

  • Consider the angle between w and w*

  • Putting it all together


Gradient descent learning rule l.jpg

Gradient Descent Learning Rule

  • Consider linear unit without threshold and continuous output o (not just –1,1)

    • o=w0 + w1 x1 + … + wn xn

  • Train the wi’s such that they minimize the squared error

    • E[w1,…,wn] = ½ dS (td-od)2

      where S is the set of training examples


Gradient descent l.jpg

(w1,w2)

Gradient:

E[w]=[E/w0,… E/wn]

(w1+w1,w2 +w2)

Gradient Descent

S={<(1,1),1>,<(-1,-1),1>,

<(1,-1),-1>,<(-1,1),-1>}

w=- E[w]

wi=- E/wi

=/wi 1/2d(td-od)2

= /wi 1/2d(td-i wi xi)2

= d(td- od)(-xi)


Gradient descent16 l.jpg

Gradient Descent

Gradient-Descent(S:training_examples, )

Until TERMINATION Do

  • Initialize each wi to zero

  • For each <x,t> in S Do

    • Compute o=<x,w>

    • For each weight wiDo

      • wi= wi +  (t-o) xi

  • For each weight wi Do1

    • wi=wi+wi


Incremental stochastic gradient descent l.jpg

Incremental Stochastic Gradient Descent

  • Batch mode : Gradient Descent

    w=w -  ES[w] over the entire data S

    ES[w]=1/2d(td-od)2

  • Incremental mode: gradient descent

    w=w -  Ed[w] over individual training examples d

    Ed[w]=1/2 (td-od)2

    Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if  is small enough


Comparison perceptron and gradient descent rule l.jpg

Comparison Perceptron and Gradient Descent Rule

Perceptron learning rule guaranteed to succeed if

  • Training examples are linearly separable

  • No guarantee otherwise

    Linear unit using Gradient Descent

  • Converges to hypothesis with minimum squared error.

  • Given sufficiently small learning rate 

  • Even when training data contains noise

  • Even when training data not linearly separable


Multi layer networks l.jpg

Multi-Layer Networks

output layer

hidden layer(s)

input layer


Sigmoid unit l.jpg

x1

x2

xn

Sigmoid Unit

x0=1

w1

w0

z=i=0n wi xi

o=(z)=1/(1+e-z)

w2

S

o

.

.

.

wn

(z) =1/(1+e-z)

sigmoid function.


Sigmoid function l.jpg

Sigmoid Function

(z) =1/(1+e-z)

d(z)/dz= (z) (1- (z))

  • Gradient Decent Rule:

  • one sigmoid function

  • E/wi = -d(td-od) od (1-od) xi

  • Multilayer networks of sigmoid units:

  • backpropagation


Backpropagation overview l.jpg

Backpropagation: overview

  • Make threshold units differentiable

    • Use sigmoid functions

  • Given a sample compute:

    • The error

    • The Gradient

  • Use the chain rule to compute the Gradient


Backpropagation motivation l.jpg

Backpropagation Motivation

  • Consider the square error

    • ES[w]=1/2d  S k  output (td,k-od,k)2

  • Gradient: ES[w]

  • Update: w=w -  ES[w]

  • How do we compute the Gradient?


Backpropagation algorithm l.jpg

Backpropagation: Algorithm

  • Forward phase:

    • Given input x, compute the output of each unit

  • Backward phase:

    • For each output k compute


Backpropagation algorithm25 l.jpg

Backpropagation: Algorithm

  • Backward phase

    • For each hidden unit h compute:

  • Update weights:

    • wi,j=wi,j+wi,jwherewi,j=  j xi


Backpropagation output node l.jpg

Backpropagation: output node


Backpropagation output node27 l.jpg

Backpropagation: output node


Backpropagation inner node l.jpg

Backpropagation: inner node


Backpropagation inner node29 l.jpg

Backpropagation: inner node


Backpropagation summary l.jpg

Backpropagation: Summary

  • Gradient descent over entire network weight vector

  • Easily generalized to arbitrary directed graphs

  • Finds a local, not necessarily global error minimum

    • in practice often works well

    • requires multiple invocations with different initial weights

  • A variation is to include momentum term

    wi,j(n)=  j xi +  wi,j (n-1)

  • Minimizes error training examples

  • Training is fairly slow, yet prediction is fast


Expressive capabilities of ann l.jpg

Expressive Capabilities of ANN

Boolean functions

  • Every boolean function can be represented by network with single hidden layer

  • But might require exponential (in number of inputs) hidden units

    Continuous functions

  • Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989, Hornik 1989]

  • Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988]


Vc dim of ann l.jpg

VC-dim of ANN

  • A more general bound.

  • Concept class F(C,G):

  • G : Directed acyclic graph

  • C: concept class, d=VC-dim(C)

  • n: input nodes

  • s : inner nodes (of degree r)

    Theorem: VC-dim(F(C,G)) < 2ds log (es)


Proof l.jpg

Proof:

  • Bound |F(C,G)(m)|

  • Find smallest d s.t. |F(C,G)(m)| <2m

  • Let S={x1, … , xm}

  • For each fixed G we define a matrix U

    • U[i,j]= ci(xj), where ci is a specific i-th concept

    • U describes the computations of S in G

  • TF(C,G) = number of different matrices.


Proof continue l.jpg

Proof (continue)

  • Clearly |F(C,G)(m)|  TF(C,G)

  • Let G’ be G without the root.

  • |F(C,G)(m)|  TF(C,G)  TF(C,G’) |C(m)|

  • Inductively, |F(C,G)(m)|  |C(m)|s

  • Recall VC Bound: |C(m)|  (em/d)d

  • Combined bound |F(C,G)(m)| (em/d)ds


Proof cont l.jpg

Proof (cont.)

  • Solve for: (em/d)ds2m

  • Holds for m  2ds log(es)

  • QED

  • Back to ANN:

  • VC-dim(C)=n+1

  • VC(ANN)  2(n+1) log (es)


  • Login