elementary concepts of neural networks n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Elementary Concepts of Neural Networks PowerPoint Presentation
Download Presentation
Elementary Concepts of Neural Networks

Loading in 2 Seconds...

play fullscreen
1 / 80

Elementary Concepts of Neural Networks - PowerPoint PPT Presentation


  • 437 Views
  • Uploaded on

Elementary Concepts of Neural Networks. Preliminaries of artificial neural network computation. Learning. Behavioral improvement through increased information about the environment . An experiment in learning. Pigeons as art experts (Watanabe et al. 1995) Experiment:

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Elementary Concepts of Neural Networks' - oshin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
elementary concepts of neural networks

Elementary Concepts of Neural Networks

Preliminaries of artificial neural network computation

learning
Learning

Behavioralimprovement through increased information about the environment.

an experiment in learning
An experiment in learning
  • Pigeons as art experts (Watanabe et al. 1995)
  • Experiment:
    • Pigeon in Skinner box
    • Present paintings of two different artists (e.g. Chagall / Van Gogh)
    • Reward when presented a particular artist (e.g. Van Gogh)
pigeons as art experts
Pigeons as art experts
  • Pigeons were able to discriminate between Van Gogh and Chagall with 95% accuracy (when presented with pictures they had been trained on)
  • Discrimination still 85% successful for previously unseen paintings of the artists
  • Pigeons do not simply memorise the pictures
  • They can extract and recognise patterns (the ‘style’)
  • They generalise from the already seen to make predictions
  • This is what neural networks (biological and artificial) are good at (unlike conventional computer)
what are neural networks
What are Neural Networks?
  • Models of the brain and nervous system
  • Highly parallel
  • Learning
  • Very simple principles
  • Very complex behaviours
  • Applications
    • as biological models
    • as powerful problem solvers
goals of neural computation
Goals of neural computation
  • To understand how the brain actually works
  • To understand a new style of computation inspired by neurons and their adaptive connections
    • Very different style from sequential computation
      • should be good for things that brain is good
      • should be bad for things that brain is bad
      • to solve practical problems by using novel learning algorithms
  • Learning algorithms can be very useful even if they have nothing to do with how the brain works
a typical cortical neuron
Gross physical structure:

There is one axon that branches

There is a dendritic tree that collects input from other neurons

Axons typically contact dendritic trees at synapses

A spike of activity in the axon causes charge to be injected into the post-synaptic neuron

Spike generation:

There is an axon that generates outgoing spikes whenever enough charge has flowed

A typical cortical neuron

axon

dendritic

tree

brain vs network
Brain vs. Network

Brain neuron

Neural network

synapses
When a spike travels along an axon and arrives at a synapse it causes vesicles of transmitter chemical to be released

The transmitter molecules diffuse across the synaptic cleft and bind to receptor molecules in the membrane of the post-synaptic neuron thus changing their shape.

The effectiveness of the synapse can be changed

Synapses are slow, but they have advantages over RAM

Massively parallel, they adapt using locally available signals (but how?)

Synapses
how the brain works
Each neuron receives inputs from other neurons

Some neurons also connect to receptors

Neurons use spikes to communicate

The timing of spikes is important

The effect of each input line on the neuron is controlled by a synaptic weight

The weights can be positive or negative

The synaptic weights adapt so that the whole network learns to perform useful computations

Recognizing objects, understanding language, making plans, controlling the body

How the brain works
idealized neurons
Idealized neurons
  • To model things we have to idealize them (e.g. atoms)
    • Idealization removes complicated details that are not essential for understanding the main principles
    • Allows us to apply mathematics and to make analogies to other, familiar systems.
    • Once we understand the basic principles, its easy to add complexity to make the model more faithful
  • It is often worth understanding models that are known to be wrong (but we mustn’t forget that they are wrong!)
    • E.g. neurons that communicate real values rather than discrete spikes of activity.
binary threshold neurons
Binary threshold neurons
  • McCulloch-Pitts (1943): influenced Von Neumann!
    • First compute a weighted sum of the inputs from other neurons
    • Then send out a fixed size spike of activity if the weighted sum exceeds a threshold.
    • Maybe each spike is like the truth value of a proposition and each neuron combines truth values to compute the truth value of another proposition!

1

1 if

y

0

0 otherwise

z

threshold

linear neurons
Linear neurons
  • These are simple but computationally limited
    • If we can make them learn we may get insight into more complicated neurons

bias

th

y

i input

0

weight on

b

0

output

th

i

input

index over

input connections

linear threshold neurons
Linear threshold neurons

These have a confusing name.

They compute a linear weighted sum of their inputs

The output is a non-linear function of the total input

y

0

0 otherwise

z

threshold

sigmoid neurons
Sigmoid neurons
  • These give a real-valued output that is a smooth and bounded function of their total input.
    • Typically they use the logistic function
    • They have nice derivatives which make learning easy (see lecture 4).
  • Local basis functions (radial) are also used

1

0.5

0

0

non linear neurons with smooth derivatives
For backpropagation, we need neurons that have well-behaved derivatives.

Typically they use the logistic function

The output is a smooth function of the inputs and the weights.

Non-linear neurons with smooth derivatives

1

0.5

0

0

types of connectivity
Feedforward networks

These compute a series of transformations

Typically, the first layer is the input and the last layer is the output.

Recurrent networks

These have directed cycles in their connection graph. They can have complicated dynamics.

More biologically realistic.

Types ofconnectivity

output units

hidden units

input units

types of learning task
Types of learning task
  • Supervised learning
    • Learn to predict output when given input vector
      • Who provides the correct answer?
  • Reinforcement learning
    • Learn action to maximize payoff
      • Not much information in a payoff signal
      • Payoff is often delayed
  • Unsupervised learning
    • Create an internal representation of the input e.g. form clusters; extract features
      • How do we know if a representation is good?
single layer feed forward
Single Layer Feed-forward

Output layer

of

neurons

Input layer

of

source nodes

multi layer feed forward
Multi layer feed-forward

3-4-2 Network

Output

layer

Input

layer

Hidden Layer

recurrent networks

z-1

z-1

z-1

Recurrent networks

Recurrent Network with a hidden neuron system

input

hidden

output

the neuron
The Neuron

Bias

b

x1

w1

Activation

function

Local

Field

v

Output

y

Input

values

x2

w2

Summing

function

xm

wm

weights

the neuron1
The Neuron
  • The neuron is the basic information processing unit of a NN. It consists of:
    • A set of links, describing the neuron inputs, with weights W1, W2, …, Wm
    • An adder function (linear combiner) for computing the weighted sum of the inputs (real numbers):
    • Activation function (squashing function) for limiting the amplitude of the neuron output.
bias as extra input
Bias as extra input

w0

x0 = +1

Activation

function

x1

w1

Local

Field

v

Input

signal

Output

y

x2

w2

Summing

function

Synaptic

weights

xm

wm

neuron models
Neuron Models
  • The choice of f determines the neuron model
  • Step function:
  • Ramp function:
  • Sigmoid function:
  • Gaussian function (Radial Basis Functions)
perceptron single neuron model

b (bias)

x1

w1

v

y

x2

w2

(v)

wn

xn

Perceptron: Single Neuron Model
  • The (McCulloch-Pitts) perceptron is a single layer NN with a non-linear , the sign function
perceptron s geometric view
Perceptron’s geometric view
  • The equation below describes a (hyper-)plane in the input space consisting of real valued m-dimensional vectors. The plane splits the input space into two regions, each of them describing one class.

x2

w1x1 + w2x2 + w0 >= 0

decision

boundary

C1

x1

C2

w1x1 + w2x2 + w0 = 0

learning with hidden units
Learning with hidden units
  • Networks without hidden units are very limited in the input-output mappings they can model.
    • More layers of linear are still linear.
  • We need multiple layers of adaptive non-linear hidden units. This gives us a universal approximator. But how can we train such nets?
    • We need an efficient way of adapting all the weights is hard
learning by perturbing weights
Randomly perturb one weight and see if it improves performance. If so, save the change.

Very inefficient. We need to do multiple forward passes on a representative set of training data just to change one weight.

Towards the end of learning, large weight perturbations will nearly always make things worse.

We could randomly perturb all the weights in parallel and correlate the performance gain with the weight changes.

Not any better because we need lots of trials to “see” the effect of changing one weight through the noise created by all the others.

Learning by perturbing weights
the idea behind backpropagation
We don’t know what the hidden units ought to do, but we can compute how fast the error changes as we change a hidden activity.

Instead of using desired activities to train the hidden units, use error derivatives w.r.t. hidden activities.

Each hidden activity can affect many output units and can therefore have many separate effects on the error. These effects must be combined.

We can compute error derivatives for all the hidden units efficiently.

Once we have the error derivatives for the hidden activities, its easy to get the error derivatives for the weights going into a hidden unit.

The idea behind backpropagation
ways to use weight derivatives
How often to update

after each training case?

after a full sweep through the training data?

How much to update

use a fixed learning rate?

adapt the learning rate?

don’t use steepest descent?

Ways to use weight derivatives
overfitting
Overfitting
  • The training data contains information about the regularities in the mapping from input to output. But it also contains noise
    • The target values may be unreliable.
    • There is sampling error. There will be accidental regularities just because of the particular training cases that were chosen.
  • When we fit the model, it cannot tell which regularities are real and which are caused by sampling error.
    • So it fits both kinds of regularity.
    • If the model is very flexible it can model the sampling error really well. This is a disaster.
simple overfitting example
Which model do you believe?

The complicated model fits the data better.

But it is not realistic!

A model is convincing when it fits a lot of data surprisingly well.

It is not surprising that a complicated model can fit a small amount of data.

Ockam’s Razor

Simple overfitting example
key characteristics
Key characteristics
  • NNs are versatile and “general” models
  • Require little, if any, insight
  • Usually impossible to interpret
    • is this yet another multivariate parameter estimation approach?
  • Well ... It depends on how they are used..
    • The basic concept behind NN modeling is to identify complex emergent behavior by combining simple elements
    • NNs should not be viewed as exercises in optimization
  • People either love of hate NNs !!!
some thoughts
Some thoughts ...
  • How do we interpret (artificial) NNs?
  • Nature shows 3 key characteristics
      • highly robust (recover memory w/ partial knowledge ... see next page)
      • highly adaptable (connections created and/or bypassed)
      • complexity emerging from simplicity
  • These could be the results of MASSIVE PARALLELISM
      • 1012-1012 neurons
      • 1014 synapses
  • Can we really built models like that?
some thoughts1
Some thoughts ...
  • How do we interpret (artificial) NNs?
  • Nature shows 3 key characteristics
      • highly robust (recover memory w/ partial knowledge ... see next page)
      • highly adaptable (connections created and/or bypassed)
      • complexity emerging from simplicity
  • These could be the results of MASSIVE PARALLELISM
      • 1012-1012 neurons
      • 1014 synapses
  • Can we really built models like that?
applications
Applications
  • Too many ... Anytime you look for an I/O relation and you lack fundamental understanding and first principles (or even “gray”) models
    • optimization Hopfield Networks
    • classification FFNN/BP
    • dimensionality reduction Autoassociative NNs
    • visualization SOM
    • modeling Recurrent NNs
    • cognitive sciences
      • a very legitimate domain ...
dimensionality reduction
Dimensionality reduction
  • PCA, nonlinear PCA ...
recurrent networks1
Recurrent Networks
  • A recurrent network with 5 nodes

4

x1

1

z4

x2

x3

3

x4

2

z5

x5

5

memories as attractors

Partial pattern

Memories as Attractors
  • Attractor Network [Hopfield, 1982]
    • store memories as dynamical attractors

Recover memory using partial

information

stochastic optimization

Stochastic Optimization

Basic preliminaries of Simulated Annealing and Genetic Algorithms

why stochastic algorithms
Why stochastic algorithms
  • Like poker ... 15 min to learn a lifetime to master ...
  • Deceptively easy to grasp and implement ... which means that the implementations can become tricky..
  • It’s straightforward to incorporate domain specific knowledge
  • Will always be producing something
    • insensitive to minor details such as differentiability, scaling, bad modeling, etc.
  • They have physical analogues making them attractive to physical scientists
why not stochastic algorithms
Why NOT stochastic algorithms
  • Convergence properties are ONLY asymptotic
  • Incorporating constrains is HIGHLY non-trivial
    • corollary: maintaining feasibility is HIGHLY non-trivial unless appropriate heuristics are used ... But then again is this stochastic or biased?
  • Can become EXTREMELY expensive (computationally) since functions are evaluate without any specific goal in mind (use with caution ...)
  • Rule of thumb
    • if the problem has a special structure use specialized algorithm
    • if reasonable algorithms exist, use them
    • if nothing is known then ... improvise
simulated annealing
Simulated Annealing

Let’s talk about annealing

the algorithm
The algorithm

for m=1 to M{

generate random move

evaluate DE

if(DE < 0){ /* downhill; accept */

accept move;update configuration

}

else{ /* uphill; accept (?) */

accept with P=exp(- DE/T)

update configuration if accepted

}

}

the main issues
The main issues
  • The move sets
    • how to create new configurations
      • random and/or heuristics
  • The cooling schedule
    • the length of the Markov Chain (M)
    • cooling schedule
  • Convergence
    • asymptotic and probabilistic
some thoughts2
Some thoughts ...
  • Direct methods have shown advantages when
    • the combinatorial complexity is overwhelming
    • the model is implicit, noisy and/or non-differentiable
  • Extensions to continuous problems are not trivial
  • SA is a framework rather than a specific algorithm
    • multiple variants
genetic algorithms
Genetic Algorithms

Let’s talk about the survival of the fittest

genetic algorithms ga overview
Genetic Algorithms (GA) OVERVIEW
  • A class of probabilistic optimization algorithms
  • Inspired by the biological evolution process
  • Uses concepts of “Natural Selection” and “Genetic Inheritance” (Darwin 1859)
  • Originally developed by John Holland (1975)
  • Particularly well suited for hard problems where little is known about the underlying search space
  • Widely-used in business, science and engineering
what is a ga
What is a GA
  • A genetic algorithm maintains a population of candidate solutions for the problem at hand, and makes it evolve by iteratively applying a set of stochastic operators
stochastic operators
Stochastic operators
  • Selection replicates the most successful solutions found in a population at a rate proportional to their relative quality
  • Recombination decomposes two distinct solutions and then randomly mixes their parts to form novel solutions
  • Mutation randomly perturbs a candidate solution
the metaphor

Genetic Algorithm

Nature

Optimization problem

Environment

Feasible solutions

Individuals living in that environment

Solutions quality (fitness function)

Individual’s degree of adaptation to its surrounding environment

The Metaphor
the metaphor cont

Genetic Algorithm

Nature

A set of feasible solutions

A population of organisms (species)

Stochastic operators

Selection, recombination and mutation in nature’s evolutionary process

Iteratively applying a set of stochastic operators on a set of feasible solutions

Evolution of populations to suit their environment

The Metaphor (cont)
simple genetic algorithm
Simple Genetic Algorithm

produce an initial population of individuals

evaluate the fitness of all individuals

while termination conditions not met do

select fitter individuals for reproduction

recombine between individuals

mutate individuals

evaluate the fitness of the modified individuals

generate a new population

End while

the evolutionary cycle
The Evolutionary Cycle

parents

selection

modification

modified

offspring

evaluation

population

evaluated offspring

deleted

members

discard

initiate &

evaluate

example initialization
Example (initialization)
  • We toss a fair coin 60 times and get the following initial population:

s1 = 1111010101 f (s1) = 7

s2 = 0111000101 f (s2) = 5

s3 = 1110110101 f (s3) = 7

s4 = 0100010011 f (s4) = 4

s5 = 1110111101 f (s5) = 8

s6 = 0100110000 f (s6) = 3

example selection1

Individual i will have a

probability to be chosen

Example (selection1)

Next we apply fitness proportionate selection with the roulette wheel method:

Area is Proportional to fitness value

1

2

We repeat the extraction as many times as the number of individuals we need to have the same parent population size (6 in our case)

n

3

4

example selection2
Example (selection2)

Suppose that, after performing selection, we get the following population:

s1` = 1111010101 (s1)

s2` = 1110110101 (s3)

s3` = 1110111101 (s5)

s4` = 0111000101 (s2)

s5` = 0100010011 (s4)

s6` = 1110111101 (s5)

example crossover1
Example (crossover1)
  • Next we mate strings for crossover. For each couple we decide according to crossover probability (for instance 0.6) whether to actually perform crossover or not
  • Suppose that we decide to actually perform crossover only for couples (s1`, s2`) and (s5`, s6`). For each couple, we randomly extract a crossover point, for instance 2 for the first and 5 for the second
example crossover2

Before crossover:

s1` = 1111010101s2` = 1110110101

s5` = 0100010011 s6` = 1110111101

After crossover:

s1`` = 1110110101s2`` = 1111010101

s5`` = 0100011101s6`` = 1110110011

Example (crossover2)
example mutation1
Example (mutation1)

The final step is to apply random mutation: for each bit that we are to copy to the new population we allow a small probability of error (for instance 0.1)

Before applying mutation:

s1`` = 1110110101

s2`` = 1111010101

s3`` = 1110111101

s4`` = 0111000101

s5`` = 0100011101

s6`` = 1110110011

example mutation2
Example (mutation2)

After applying mutation:

s1``` = 1110100101 f (s1``` ) = 6

s2``` = 1111110100 f (s2``` ) = 7

s3``` = 1110101111 f (s3``` ) = 8

s4``` = 0111000101 f (s4``` ) = 5

s5``` = 0100011101 f (s5``` ) = 5

s6``` = 1110110001 f (s6``` ) = 6

components of a ga
Components of a GA
  • A problem definition as input, and
    • Encoding principles (gene, chromosome)
    • Initialization procedure (creation)
    • Selection of parents (reproduction)
    • Genetic operators (mutation, recombination)
    • Evaluation function (environment)
    • Termination condition
the traveling salesman problem tsp
The Traveling Salesman Problem (TSP)
  • The traveling salesman must visit every city in his territory exactly once and then return to the starting point; given the cost of travel between all cities, how should he plan his itinerary for minimum total cost of the entire tour?
  • TSP  NP-Complete
tsp representation evaluation initialization and selection
TSP (Representation, Evaluation, Initialization and Selection)
  • A vector v = (i1 i2… in) represents a tour (v is a permutation of {1,2,…,n})
  • Fitness f of a solution is the inverse cost of the corresponding tour
  • Initialization: use either some heuristics, or a random sample of permutations of {1,2,…,n}
  • We shall use the fitness proportionate selection
tsp heuristic inversion
TSP Heuristic (Inversion)
  • The sub-string between two randomly selected points in the path is reversed
  • Example:
    • (1 2 3 4 5 6 7 8 9) .. (1 2 7 6 5 4 3 8 9)
  • Such simple inversion guarantees that the resulting offspring is a legal tour
notation schema
Notation (schema)
  • {0,1,#} is the symbol alphabet, where # is a special wild card symbol
  • A schema is a template consisting of a string composed of these three symbols
  • Example: the schema [01#1#] matches the strings: [01010], [01011], [01110] and [01111]
notation order
Notation (order)
  • The order of the schema S (denoted by o(S)) is the number of fixed positions (0 or 1) presented in the schema
  • Example
    • for S1 = [01#1#], o(S1) = 3
    • for S2 = [##1#1010], o(S2) = 5
  • The order of a schema is useful to calculate survival probability of the schema for mutations
schema theorem
Schema Theorem
  • Short, low-order, above-average schemata receive exponentially increasing trials in subsequent generations of a genetic algorithm
  • Result: GAs explore the search space by short, low-order schemata which, subsequently, are used for information exchange during crossover
building block hypothesis
Building Block Hypothesis
  • A genetic algorithm seeks near-optimal performance through the juxtaposition of short, low-order, high-performance schemata, called the building blocks
  • The building block hypothesis has been found to apply in many cases but it depends on the representation and genetic operators used
some thoughts3
Some thoughts ...
  • GAs are simple to implement but share similar advantages and disadvantages with other stochastic optimization methods
  • An additional critical point is the space representation
    • operators are defined on strings and therefore an appropriate mapping of the search space needs to be defined
what should you do
What should you do?
  • Use with caution and when appropriate
    • phenomenal opportunities for modeling complex systems and studying emergence and nonlinear phenomena
  • Rule of thumb
    • if the problem has a special structure use specialized algorithm
    • if reasonable algorithms exist, use them
    • if nothing is known then ... Improvise
  • A lot of room for improvement based on systems approaches as taught in this course