Artificial neural networks l.jpg
Sponsored Links
This presentation is the property of its rightful owner.
1 / 35

Artificial Neural Networks PowerPoint PPT Presentation


  • 133 Views
  • Updated On :
  • Presentation posted in: General

Artificial Neural Networks. Outline. Biological Motivation Perceptron Gradient Descent Least Mean Square Error Multi-layer networks Sigmoid node Backpropagation. Biological Neural Systems. Neuron switching time : > 10 -3 secs Number of neurons in the human brain: ~10 10

Download Presentation

Artificial Neural Networks

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Artificial Neural Networks


Outline

  • Biological Motivation

  • Perceptron

  • Gradient Descent

    • Least Mean Square Error

  • Multi-layer networks

    • Sigmoid node

    • Backpropagation


Biological Neural Systems

  • Neuron switching time : > 10-3 secs

  • Number of neurons in the human brain: ~1010

  • Connections (synapses) per neuron : ~104–105

  • Face recognition : 0.1 secs

  • High degree of parallel computation

  • Distributed representations


Artificial Neural Networks

  • Many simple neuron-like threshold units

  • Many weighted interconnections

  • Multiple outputs

  • Highly parallel and distributed processing

  • Learning by tuning the connection weights


x1

x2

xn

Perceptron: Linear threshold unit

x0=1

w1

w0

w2

S

o

.

.

.

i=0n wi xi

wn

1 if i=0nwi xi >0

o(xi)=

-1 otherwise

{


x2

+

-

x1

+

-

Xor

Decision Surface of a Perceptron

x2

+

+

+

-

-

x1

+

-

-

Linearly Separable

Theorem: VC-dim = n+1


Perceptron Learning Rule

S sample

xi input vector

t=c(x) is the target value

o is the perceptron output

 learning rate(a small constant ), assume=1

wi = wi + wi

wi =  (t - o) xi


Perceptron Algo.

  • Correct Output (t=o)

    • Weights are unchanged

  • Incorrect Output (to)

    • Change weights !

  • False Positive (t=1 and o=-1)

    • Add x to w

  • False Negative (t=-1 and o=1)

    • Subtract x from w


t=-1

t=1

o=1

w=[0.25 –0.1 0.5]

x2 = 0.2 x1 – 0.5

o=-1

(x,t)=([2,1],-1)

o=sgn(0.45-0.6+0.3)

=1

(x,t)=([-1,-1],1)

o=sgn(0.25+0.1-0.5)

=-1

w=[0.2 –0.2 –0.2]

w=[-0.2 –0.4 –0.2]

(x,t)=([1,1],1)

o=sgn(0.25-0.7+0.1)

=-1

w=[0.2 0.2 0.2]

Perceptron Learning Rule


Perceptron Algorithm: Analysis

  • Theorem: The number of errors of the Perceptron Algorithm is bounded

  • Proof:

  • Make all examples positive

    • change <xi,bi> to <bixi, +1>

  • Margin of hyperplan w


Perceptron Algorithm: Analysis II

  • Let mibe the number of errors of xi

    • M=  mi

  • From the algorithm: w=  mixi

  • Let w* be a separating hyperplane


Perceptron Algorithm: Analysis III

  • Change in weights:

  • Since w errs on xi , we have wxi <0

  • Total weight:


Perceptron Algorithm: Analysis IV

  • Consider the angle between w and w*

  • Putting it all together


Gradient Descent Learning Rule

  • Consider linear unit without threshold and continuous output o (not just –1,1)

    • o=w0 + w1 x1 + … + wn xn

  • Train the wi’s such that they minimize the squared error

    • E[w1,…,wn] = ½ dS (td-od)2

      where S is the set of training examples


(w1,w2)

Gradient:

E[w]=[E/w0,… E/wn]

(w1+w1,w2 +w2)

Gradient Descent

S={<(1,1),1>,<(-1,-1),1>,

<(1,-1),-1>,<(-1,1),-1>}

w=- E[w]

wi=- E/wi

=/wi 1/2d(td-od)2

= /wi 1/2d(td-i wi xi)2

= d(td- od)(-xi)


Gradient Descent

Gradient-Descent(S:training_examples, )

Until TERMINATION Do

  • Initialize each wi to zero

  • For each <x,t> in S Do

    • Compute o=<x,w>

    • For each weight wiDo

      • wi= wi +  (t-o) xi

  • For each weight wi Do1

    • wi=wi+wi


Incremental Stochastic Gradient Descent

  • Batch mode : Gradient Descent

    w=w -  ES[w] over the entire data S

    ES[w]=1/2d(td-od)2

  • Incremental mode: gradient descent

    w=w -  Ed[w] over individual training examples d

    Ed[w]=1/2 (td-od)2

    Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if  is small enough


Comparison Perceptron and Gradient Descent Rule

Perceptron learning rule guaranteed to succeed if

  • Training examples are linearly separable

  • No guarantee otherwise

    Linear unit using Gradient Descent

  • Converges to hypothesis with minimum squared error.

  • Given sufficiently small learning rate 

  • Even when training data contains noise

  • Even when training data not linearly separable


Multi-Layer Networks

output layer

hidden layer(s)

input layer


x1

x2

xn

Sigmoid Unit

x0=1

w1

w0

z=i=0n wi xi

o=(z)=1/(1+e-z)

w2

S

o

.

.

.

wn

(z) =1/(1+e-z)

sigmoid function.


Sigmoid Function

(z) =1/(1+e-z)

d(z)/dz= (z) (1- (z))

  • Gradient Decent Rule:

  • one sigmoid function

  • E/wi = -d(td-od) od (1-od) xi

  • Multilayer networks of sigmoid units:

  • backpropagation


Backpropagation: overview

  • Make threshold units differentiable

    • Use sigmoid functions

  • Given a sample compute:

    • The error

    • The Gradient

  • Use the chain rule to compute the Gradient


Backpropagation Motivation

  • Consider the square error

    • ES[w]=1/2d  S k  output (td,k-od,k)2

  • Gradient: ES[w]

  • Update: w=w -  ES[w]

  • How do we compute the Gradient?


Backpropagation: Algorithm

  • Forward phase:

    • Given input x, compute the output of each unit

  • Backward phase:

    • For each output k compute


Backpropagation: Algorithm

  • Backward phase

    • For each hidden unit h compute:

  • Update weights:

    • wi,j=wi,j+wi,jwherewi,j=  j xi


Backpropagation: output node


Backpropagation: output node


Backpropagation: inner node


Backpropagation: inner node


Backpropagation: Summary

  • Gradient descent over entire network weight vector

  • Easily generalized to arbitrary directed graphs

  • Finds a local, not necessarily global error minimum

    • in practice often works well

    • requires multiple invocations with different initial weights

  • A variation is to include momentum term

    wi,j(n)=  j xi +  wi,j (n-1)

  • Minimizes error training examples

  • Training is fairly slow, yet prediction is fast


Expressive Capabilities of ANN

Boolean functions

  • Every boolean function can be represented by network with single hidden layer

  • But might require exponential (in number of inputs) hidden units

    Continuous functions

  • Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989, Hornik 1989]

  • Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988]


VC-dim of ANN

  • A more general bound.

  • Concept class F(C,G):

  • G : Directed acyclic graph

  • C: concept class, d=VC-dim(C)

  • n: input nodes

  • s : inner nodes (of degree r)

    Theorem: VC-dim(F(C,G)) < 2ds log (es)


Proof:

  • Bound |F(C,G)(m)|

  • Find smallest d s.t. |F(C,G)(m)| <2m

  • Let S={x1, … , xm}

  • For each fixed G we define a matrix U

    • U[i,j]= ci(xj), where ci is a specific i-th concept

    • U describes the computations of S in G

  • TF(C,G) = number of different matrices.


Proof (continue)

  • Clearly |F(C,G)(m)|  TF(C,G)

  • Let G’ be G without the root.

  • |F(C,G)(m)|  TF(C,G)  TF(C,G’) |C(m)|

  • Inductively, |F(C,G)(m)|  |C(m)|s

  • Recall VC Bound: |C(m)|  (em/d)d

  • Combined bound |F(C,G)(m)| (em/d)ds


Proof (cont.)

  • Solve for: (em/d)ds2m

  • Holds for m  2ds log(es)

  • QED

  • Back to ANN:

  • VC-dim(C)=n+1

  • VC(ANN)  2(n+1) log (es)


  • Login