artificial neural networks l.
Download
Skip this Video
Download Presentation
Artificial Neural Networks

Loading in 2 Seconds...

play fullscreen
1 / 35

Artificial Neural Networks - PowerPoint PPT Presentation


  • 189 Views
  • Uploaded on

Artificial Neural Networks. Outline. Biological Motivation Perceptron Gradient Descent Least Mean Square Error Multi-layer networks Sigmoid node Backpropagation. Biological Neural Systems. Neuron switching time : > 10 -3 secs Number of neurons in the human brain: ~10 10

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Artificial Neural Networks' - lotte


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
outline
Outline
  • Biological Motivation
  • Perceptron
  • Gradient Descent
    • Least Mean Square Error
  • Multi-layer networks
    • Sigmoid node
    • Backpropagation
biological neural systems
Biological Neural Systems
  • Neuron switching time : > 10-3 secs
  • Number of neurons in the human brain: ~1010
  • Connections (synapses) per neuron : ~104–105
  • Face recognition : 0.1 secs
  • High degree of parallel computation
  • Distributed representations
artificial neural networks4
Artificial Neural Networks
  • Many simple neuron-like threshold units
  • Many weighted interconnections
  • Multiple outputs
  • Highly parallel and distributed processing
  • Learning by tuning the connection weights
perceptron linear threshold unit

x1

x2

xn

Perceptron: Linear threshold unit

x0=1

w1

w0

w2

S

o

.

.

.

i=0n wi xi

wn

1 if i=0nwi xi >0

o(xi)=

-1 otherwise

{

decision surface of a perceptron

x2

+

-

x1

+

-

Xor

Decision Surface of a Perceptron

x2

+

+

+

-

-

x1

+

-

-

Linearly Separable

Theorem: VC-dim = n+1

perceptron learning rule
Perceptron Learning Rule

S sample

xi input vector

t=c(x) is the target value

o is the perceptron output

 learning rate(a small constant ), assume=1

wi = wi + wi

wi =  (t - o) xi

perceptron algo
Perceptron Algo.
  • Correct Output (t=o)
    • Weights are unchanged
  • Incorrect Output (to)
    • Change weights !
  • False Positive (t=1 and o=-1)
    • Add x to w
  • False Negative (t=-1 and o=1)
    • Subtract x from w
perceptron learning rule9

t=-1

t=1

o=1

w=[0.25 –0.1 0.5]

x2 = 0.2 x1 – 0.5

o=-1

(x,t)=([2,1],-1)

o=sgn(0.45-0.6+0.3)

=1

(x,t)=([-1,-1],1)

o=sgn(0.25+0.1-0.5)

=-1

w=[0.2 –0.2 –0.2]

w=[-0.2 –0.4 –0.2]

(x,t)=([1,1],1)

o=sgn(0.25-0.7+0.1)

=-1

w=[0.2 0.2 0.2]

Perceptron Learning Rule
perceptron algorithm analysis
Perceptron Algorithm: Analysis
  • Theorem: The number of errors of the Perceptron Algorithm is bounded
  • Proof:
  • Make all examples positive
    • change <xi,bi> to <bixi, +1>
  • Margin of hyperplan w
perceptron algorithm analysis ii
Perceptron Algorithm: Analysis II
  • Let mibe the number of errors of xi
    • M=  mi
  • From the algorithm: w=  mixi
  • Let w* be a separating hyperplane
perceptron algorithm analysis iii
Perceptron Algorithm: Analysis III
  • Change in weights:
  • Since w errs on xi , we have wxi <0
  • Total weight:
perceptron algorithm analysis iv
Perceptron Algorithm: Analysis IV
  • Consider the angle between w and w*
  • Putting it all together
gradient descent learning rule
Gradient Descent Learning Rule
  • Consider linear unit without threshold and continuous output o (not just –1,1)
    • o=w0 + w1 x1 + … + wn xn
  • Train the wi’s such that they minimize the squared error
    • E[w1,…,wn] = ½ dS (td-od)2

where S is the set of training examples

gradient descent

(w1,w2)

Gradient:

E[w]=[E/w0,… E/wn]

(w1+w1,w2 +w2)

Gradient Descent

S={<(1,1),1>,<(-1,-1),1>,

<(1,-1),-1>,<(-1,1),-1>}

w=- E[w]

wi=- E/wi

=/wi 1/2d(td-od)2

= /wi 1/2d(td-i wi xi)2

= d(td- od)(-xi)

gradient descent16
Gradient Descent

Gradient-Descent(S:training_examples, )

Until TERMINATION Do

  • Initialize each wi to zero
  • For each <x,t> in S Do
    • Compute o=<x,w>
    • For each weight wiDo
      • wi= wi +  (t-o) xi
  • For each weight wi Do1
    • wi=wi+wi
incremental stochastic gradient descent
Incremental Stochastic Gradient Descent
  • Batch mode : Gradient Descent

w=w -  ES[w] over the entire data S

ES[w]=1/2d(td-od)2

  • Incremental mode: gradient descent

w=w -  Ed[w] over individual training examples d

Ed[w]=1/2 (td-od)2

Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if  is small enough

comparison perceptron and gradient descent rule
Comparison Perceptron and Gradient Descent Rule

Perceptron learning rule guaranteed to succeed if

  • Training examples are linearly separable
  • No guarantee otherwise

Linear unit using Gradient Descent

  • Converges to hypothesis with minimum squared error.
  • Given sufficiently small learning rate 
  • Even when training data contains noise
  • Even when training data not linearly separable
multi layer networks
Multi-Layer Networks

output layer

hidden layer(s)

input layer

sigmoid unit

x1

x2

xn

Sigmoid Unit

x0=1

w1

w0

z=i=0n wi xi

o=(z)=1/(1+e-z)

w2

S

o

.

.

.

wn

(z) =1/(1+e-z)

sigmoid function.

sigmoid function
Sigmoid Function

(z) =1/(1+e-z)

d(z)/dz= (z) (1- (z))

  • Gradient Decent Rule:
  • one sigmoid function
  • E/wi = -d(td-od) od (1-od) xi
  • Multilayer networks of sigmoid units:
  • backpropagation
backpropagation overview
Backpropagation: overview
  • Make threshold units differentiable
    • Use sigmoid functions
  • Given a sample compute:
    • The error
    • The Gradient
  • Use the chain rule to compute the Gradient
backpropagation motivation
Backpropagation Motivation
  • Consider the square error
    • ES[w]=1/2d  S k  output (td,k-od,k)2
  • Gradient: ES[w]
  • Update: w=w -  ES[w]
  • How do we compute the Gradient?
backpropagation algorithm
Backpropagation: Algorithm
  • Forward phase:
    • Given input x, compute the output of each unit
  • Backward phase:
    • For each output k compute
backpropagation algorithm25
Backpropagation: Algorithm
  • Backward phase
    • For each hidden unit h compute:
  • Update weights:
    • wi,j=wi,j+wi,jwherewi,j=  j xi
backpropagation summary
Backpropagation: Summary
  • Gradient descent over entire network weight vector
  • Easily generalized to arbitrary directed graphs
  • Finds a local, not necessarily global error minimum
    • in practice often works well
    • requires multiple invocations with different initial weights
  • A variation is to include momentum term

wi,j(n)=  j xi +  wi,j (n-1)

  • Minimizes error training examples
  • Training is fairly slow, yet prediction is fast
expressive capabilities of ann
Expressive Capabilities of ANN

Boolean functions

  • Every boolean function can be represented by network with single hidden layer
  • But might require exponential (in number of inputs) hidden units

Continuous functions

  • Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989, Hornik 1989]
  • Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988]
vc dim of ann
VC-dim of ANN
  • A more general bound.
  • Concept class F(C,G):
  • G : Directed acyclic graph
  • C: concept class, d=VC-dim(C)
  • n: input nodes
  • s : inner nodes (of degree r)

Theorem: VC-dim(F(C,G)) < 2ds log (es)

proof
Proof:
  • Bound |F(C,G)(m)|
  • Find smallest d s.t. |F(C,G)(m)| <2m
  • Let S={x1, … , xm}
  • For each fixed G we define a matrix U
    • U[i,j]= ci(xj), where ci is a specific i-th concept
    • U describes the computations of S in G
  • TF(C,G) = number of different matrices.
proof continue
Proof (continue)
  • Clearly |F(C,G)(m)|  TF(C,G)
  • Let G’ be G without the root.
  • |F(C,G)(m)|  TF(C,G)  TF(C,G’) |C(m)|
  • Inductively, |F(C,G)(m)|  |C(m)|s
  • Recall VC Bound: |C(m)|  (em/d)d
  • Combined bound |F(C,G)(m)| (em/d)ds
proof cont
Proof (cont.)
  • Solve for: (em/d)ds2m
  • Holds for m  2ds log(es)
  • QED
  • Back to ANN:
  • VC-dim(C)=n+1
  • VC(ANN)  2(n+1) log (es)