- By
**oshin** - Follow User

- 439 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Elementary Concepts of Neural Networks' - oshin

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Elementary Concepts of Neural Networks

### Stochastic Optimization

Preliminaries of artificial neural network computation

Learning

Behavioralimprovement through increased information about the environment.

An experiment in learning

- Pigeons as art experts (Watanabe et al. 1995)
- Experiment:
- Pigeon in Skinner box
- Present paintings of two different artists (e.g. Chagall / Van Gogh)
- Reward when presented a particular artist (e.g. Van Gogh)

Pigeons as art experts

- Pigeons were able to discriminate between Van Gogh and Chagall with 95% accuracy (when presented with pictures they had been trained on)
- Discrimination still 85% successful for previously unseen paintings of the artists
- Pigeons do not simply memorise the pictures
- They can extract and recognise patterns (the ‘style’)
- They generalise from the already seen to make predictions
- This is what neural networks (biological and artificial) are good at (unlike conventional computer)

What are Neural Networks?

- Models of the brain and nervous system
- Highly parallel
- Learning
- Very simple principles
- Very complex behaviours
- Applications
- as biological models
- as powerful problem solvers

Goals of neural computation

- To understand how the brain actually works
- To understand a new style of computation inspired by neurons and their adaptive connections
- Very different style from sequential computation
- should be good for things that brain is good
- should be bad for things that brain is bad
- to solve practical problems by using novel learning algorithms
- Learning algorithms can be very useful even if they have nothing to do with how the brain works

Gross physical structure:

There is one axon that branches

There is a dendritic tree that collects input from other neurons

Axons typically contact dendritic trees at synapses

A spike of activity in the axon causes charge to be injected into the post-synaptic neuron

Spike generation:

There is an axon that generates outgoing spikes whenever enough charge has flowed

A typical cortical neuronaxon

dendritic

tree

When a spike travels along an axon and arrives at a synapse it causes vesicles of transmitter chemical to be released

The transmitter molecules diffuse across the synaptic cleft and bind to receptor molecules in the membrane of the post-synaptic neuron thus changing their shape.

The effectiveness of the synapse can be changed

Synapses are slow, but they have advantages over RAM

Massively parallel, they adapt using locally available signals (but how?)

SynapsesEach neuron receives inputs from other neurons

Some neurons also connect to receptors

Neurons use spikes to communicate

The timing of spikes is important

The effect of each input line on the neuron is controlled by a synaptic weight

The weights can be positive or negative

The synaptic weights adapt so that the whole network learns to perform useful computations

Recognizing objects, understanding language, making plans, controlling the body

How the brain worksIdealized neurons

- To model things we have to idealize them (e.g. atoms)
- Idealization removes complicated details that are not essential for understanding the main principles
- Allows us to apply mathematics and to make analogies to other, familiar systems.
- Once we understand the basic principles, its easy to add complexity to make the model more faithful
- It is often worth understanding models that are known to be wrong (but we mustn’t forget that they are wrong!)
- E.g. neurons that communicate real values rather than discrete spikes of activity.

Binary threshold neurons

- McCulloch-Pitts (1943): influenced Von Neumann!
- First compute a weighted sum of the inputs from other neurons
- Then send out a fixed size spike of activity if the weighted sum exceeds a threshold.
- Maybe each spike is like the truth value of a proposition and each neuron combines truth values to compute the truth value of another proposition!

1

1 if

y

0

0 otherwise

z

threshold

Linear neurons

- These are simple but computationally limited
- If we can make them learn we may get insight into more complicated neurons

bias

th

y

i input

0

weight on

b

0

output

th

i

input

index over

input connections

Linear threshold neurons

These have a confusing name.

They compute a linear weighted sum of their inputs

The output is a non-linear function of the total input

y

0

0 otherwise

z

threshold

Sigmoid neurons

- These give a real-valued output that is a smooth and bounded function of their total input.
- Typically they use the logistic function
- They have nice derivatives which make learning easy (see lecture 4).
- Local basis functions (radial) are also used

1

0.5

0

0

For backpropagation, we need neurons that have well-behaved derivatives.

Typically they use the logistic function

The output is a smooth function of the inputs and the weights.

Non-linear neurons with smooth derivatives1

0.5

0

0

Feedforward networks

These compute a series of transformations

Typically, the first layer is the input and the last layer is the output.

Recurrent networks

These have directed cycles in their connection graph. They can have complicated dynamics.

More biologically realistic.

Types ofconnectivityoutput units

hidden units

input units

Types of learning task

- Supervised learning
- Learn to predict output when given input vector
- Who provides the correct answer?
- Reinforcement learning
- Learn action to maximize payoff
- Not much information in a payoff signal
- Payoff is often delayed
- Unsupervised learning
- Create an internal representation of the input e.g. form clusters; extract features
- How do we know if a representation is good?

The Neuron

- The neuron is the basic information processing unit of a NN. It consists of:
- A set of links, describing the neuron inputs, with weights W1, W2, …, Wm
- An adder function (linear combiner) for computing the weighted sum of the inputs (real numbers):
- Activation function (squashing function) for limiting the amplitude of the neuron output.

Bias as extra input

w0

x0 = +1

Activation

function

x1

w1

Local

Field

v

Input

signal

Output

y

x2

w2

Summing

function

Synaptic

weights

xm

wm

Neuron Models

- The choice of f determines the neuron model
- Step function:
- Ramp function:
- Sigmoid function:
- Gaussian function (Radial Basis Functions)

x1

w1

v

y

x2

w2

(v)

wn

xn

Perceptron: Single Neuron Model- The (McCulloch-Pitts) perceptron is a single layer NN with a non-linear , the sign function

Perceptron’s geometric view

- The equation below describes a (hyper-)plane in the input space consisting of real valued m-dimensional vectors. The plane splits the input space into two regions, each of them describing one class.

x2

w1x1 + w2x2 + w0 >= 0

decision

boundary

C1

x1

C2

w1x1 + w2x2 + w0 = 0

Learning with hidden units

- Networks without hidden units are very limited in the input-output mappings they can model.
- More layers of linear are still linear.
- We need multiple layers of adaptive non-linear hidden units. This gives us a universal approximator. But how can we train such nets?
- We need an efficient way of adapting all the weights is hard

Randomly perturb one weight and see if it improves performance. If so, save the change.

Very inefficient. We need to do multiple forward passes on a representative set of training data just to change one weight.

Towards the end of learning, large weight perturbations will nearly always make things worse.

We could randomly perturb all the weights in parallel and correlate the performance gain with the weight changes.

Not any better because we need lots of trials to “see” the effect of changing one weight through the noise created by all the others.

Learning by perturbing weightsWe don’t know what the hidden units ought to do, but we can compute how fast the error changes as we change a hidden activity.

Instead of using desired activities to train the hidden units, use error derivatives w.r.t. hidden activities.

Each hidden activity can affect many output units and can therefore have many separate effects on the error. These effects must be combined.

We can compute error derivatives for all the hidden units efficiently.

Once we have the error derivatives for the hidden activities, its easy to get the error derivatives for the weights going into a hidden unit.

The idea behind backpropagationSketch of backpropagation (d-rule)

let’s derive it ....

How often to update

after each training case?

after a full sweep through the training data?

How much to update

use a fixed learning rate?

adapt the learning rate?

don’t use steepest descent?

Ways to use weight derivativesOverfitting

- The training data contains information about the regularities in the mapping from input to output. But it also contains noise
- The target values may be unreliable.
- There is sampling error. There will be accidental regularities just because of the particular training cases that were chosen.
- When we fit the model, it cannot tell which regularities are real and which are caused by sampling error.
- So it fits both kinds of regularity.
- If the model is very flexible it can model the sampling error really well. This is a disaster.

Which model do you believe?

The complicated model fits the data better.

But it is not realistic!

A model is convincing when it fits a lot of data surprisingly well.

It is not surprising that a complicated model can fit a small amount of data.

Ockam’s Razor

Simple overfitting exampleKey characteristics

- NNs are versatile and “general” models
- Require little, if any, insight
- Usually impossible to interpret
- is this yet another multivariate parameter estimation approach?
- Well ... It depends on how they are used..
- The basic concept behind NN modeling is to identify complex emergent behavior by combining simple elements
- NNs should not be viewed as exercises in optimization
- People either love of hate NNs !!!

Some thoughts ...

- How do we interpret (artificial) NNs?
- Nature shows 3 key characteristics
- highly robust (recover memory w/ partial knowledge ... see next page)
- highly adaptable (connections created and/or bypassed)
- complexity emerging from simplicity
- These could be the results of MASSIVE PARALLELISM
- 1012-1012 neurons
- 1014 synapses
- Can we really built models like that?

Some thoughts ...

- How do we interpret (artificial) NNs?
- Nature shows 3 key characteristics
- highly robust (recover memory w/ partial knowledge ... see next page)
- highly adaptable (connections created and/or bypassed)
- complexity emerging from simplicity
- These could be the results of MASSIVE PARALLELISM
- 1012-1012 neurons
- 1014 synapses
- Can we really built models like that?

Applications

- Too many ... Anytime you look for an I/O relation and you lack fundamental understanding and first principles (or even “gray”) models
- optimization Hopfield Networks
- classification FFNN/BP
- dimensionality reduction Autoassociative NNs
- visualization SOM
- modeling Recurrent NNs
- cognitive sciences
- a very legitimate domain ...

Dimensionality reduction

- PCA, nonlinear PCA ...

Memories as Attractors

- Attractor Network [Hopfield, 1982]
- store memories as dynamical attractors

Recover memory using partial

information

Basic preliminaries of Simulated Annealing and Genetic Algorithms

Optimization, dynamic systems and iterative maps

let’s talk about that ...

Why stochastic algorithms

- Like poker ... 15 min to learn a lifetime to master ...
- Deceptively easy to grasp and implement ... which means that the implementations can become tricky..
- It’s straightforward to incorporate domain specific knowledge
- Will always be producing something
- insensitive to minor details such as differentiability, scaling, bad modeling, etc.
- They have physical analogues making them attractive to physical scientists

Why NOT stochastic algorithms

- Convergence properties are ONLY asymptotic
- Incorporating constrains is HIGHLY non-trivial
- corollary: maintaining feasibility is HIGHLY non-trivial unless appropriate heuristics are used ... But then again is this stochastic or biased?
- Can become EXTREMELY expensive (computationally) since functions are evaluate without any specific goal in mind (use with caution ...)
- Rule of thumb
- if the problem has a special structure use specialized algorithm
- if reasonable algorithms exist, use them
- if nothing is known then ... improvise

Simulated Annealing

Let’s talk about annealing

The algorithm

for m=1 to M{

generate random move

evaluate DE

if(DE < 0){ /* downhill; accept */

accept move;update configuration

}

else{ /* uphill; accept (?) */

accept with P=exp(- DE/T)

update configuration if accepted

}

}

The main issues

- The move sets
- how to create new configurations
- random and/or heuristics
- The cooling schedule
- the length of the Markov Chain (M)
- cooling schedule
- Convergence
- asymptotic and probabilistic

Some thoughts ...

- Direct methods have shown advantages when
- the combinatorial complexity is overwhelming
- the model is implicit, noisy and/or non-differentiable
- Extensions to continuous problems are not trivial
- SA is a framework rather than a specific algorithm
- multiple variants

Genetic Algorithms

Let’s talk about the survival of the fittest

Genetic Algorithms (GA) OVERVIEW

- A class of probabilistic optimization algorithms
- Inspired by the biological evolution process
- Uses concepts of “Natural Selection” and “Genetic Inheritance” (Darwin 1859)
- Originally developed by John Holland (1975)
- Particularly well suited for hard problems where little is known about the underlying search space
- Widely-used in business, science and engineering

What is a GA

- A genetic algorithm maintains a population of candidate solutions for the problem at hand, and makes it evolve by iteratively applying a set of stochastic operators

Stochastic operators

- Selection replicates the most successful solutions found in a population at a rate proportional to their relative quality
- Recombination decomposes two distinct solutions and then randomly mixes their parts to form novel solutions
- Mutation randomly perturbs a candidate solution

Nature

Optimization problem

Environment

Feasible solutions

Individuals living in that environment

Solutions quality (fitness function)

Individual’s degree of adaptation to its surrounding environment

The MetaphorNature

A set of feasible solutions

A population of organisms (species)

Stochastic operators

Selection, recombination and mutation in nature’s evolutionary process

Iteratively applying a set of stochastic operators on a set of feasible solutions

Evolution of populations to suit their environment

The Metaphor (cont)Simple Genetic Algorithm

produce an initial population of individuals

evaluate the fitness of all individuals

while termination conditions not met do

select fitter individuals for reproduction

recombine between individuals

mutate individuals

evaluate the fitness of the modified individuals

generate a new population

End while

The Evolutionary Cycle

parents

selection

modification

modified

offspring

evaluation

population

evaluated offspring

deleted

members

discard

initiate &

evaluate

Example (initialization)

- We toss a fair coin 60 times and get the following initial population:

s1 = 1111010101 f (s1) = 7

s2 = 0111000101 f (s2) = 5

s3 = 1110110101 f (s3) = 7

s4 = 0100010011 f (s4) = 4

s5 = 1110111101 f (s5) = 8

s6 = 0100110000 f (s6) = 3

probability to be chosen

Example (selection1)Next we apply fitness proportionate selection with the roulette wheel method:

Area is Proportional to fitness value

1

2

We repeat the extraction as many times as the number of individuals we need to have the same parent population size (6 in our case)

n

3

4

Example (selection2)

Suppose that, after performing selection, we get the following population:

s1` = 1111010101 (s1)

s2` = 1110110101 (s3)

s3` = 1110111101 (s5)

s4` = 0111000101 (s2)

s5` = 0100010011 (s4)

s6` = 1110111101 (s5)

Example (crossover1)

- Next we mate strings for crossover. For each couple we decide according to crossover probability (for instance 0.6) whether to actually perform crossover or not
- Suppose that we decide to actually perform crossover only for couples (s1`, s2`) and (s5`, s6`). For each couple, we randomly extract a crossover point, for instance 2 for the first and 5 for the second

s1` = 1111010101s2` = 1110110101

s5` = 0100010011 s6` = 1110111101

After crossover:

s1`` = 1110110101s2`` = 1111010101

s5`` = 0100011101s6`` = 1110110011

Example (crossover2)Example (mutation1)

The final step is to apply random mutation: for each bit that we are to copy to the new population we allow a small probability of error (for instance 0.1)

Before applying mutation:

s1`` = 1110110101

s2`` = 1111010101

s3`` = 1110111101

s4`` = 0111000101

s5`` = 0100011101

s6`` = 1110110011

Example (mutation2)

After applying mutation:

s1``` = 1110100101 f (s1``` ) = 6

s2``` = 1111110100 f (s2``` ) = 7

s3``` = 1110101111 f (s3``` ) = 8

s4``` = 0111000101 f (s4``` ) = 5

s5``` = 0100011101 f (s5``` ) = 5

s6``` = 1110110001 f (s6``` ) = 6

Components of a GA

- A problem definition as input, and
- Encoding principles (gene, chromosome)
- Initialization procedure (creation)
- Selection of parents (reproduction)
- Genetic operators (mutation, recombination)
- Evaluation function (environment)
- Termination condition

The Traveling Salesman Problem (TSP)

- The traveling salesman must visit every city in his territory exactly once and then return to the starting point; given the cost of travel between all cities, how should he plan his itinerary for minimum total cost of the entire tour?
- TSP NP-Complete

TSP (Representation, Evaluation, Initialization and Selection)

- A vector v = (i1 i2… in) represents a tour (v is a permutation of {1,2,…,n})
- Fitness f of a solution is the inverse cost of the corresponding tour
- Initialization: use either some heuristics, or a random sample of permutations of {1,2,…,n}
- We shall use the fitness proportionate selection

TSP Heuristic (Inversion)

- The sub-string between two randomly selected points in the path is reversed
- Example:
- (1 2 3 4 5 6 7 8 9) .. (1 2 7 6 5 4 3 8 9)
- Such simple inversion guarantees that the resulting offspring is a legal tour

Notation (schema)

- {0,1,#} is the symbol alphabet, where # is a special wild card symbol
- A schema is a template consisting of a string composed of these three symbols
- Example: the schema [01#1#] matches the strings: [01010], [01011], [01110] and [01111]

Notation (order)

- The order of the schema S (denoted by o(S)) is the number of fixed positions (0 or 1) presented in the schema
- Example
- for S1 = [01#1#], o(S1) = 3
- for S2 = [##1#1010], o(S2) = 5
- The order of a schema is useful to calculate survival probability of the schema for mutations

Schema Theorem

- Short, low-order, above-average schemata receive exponentially increasing trials in subsequent generations of a genetic algorithm
- Result: GAs explore the search space by short, low-order schemata which, subsequently, are used for information exchange during crossover

Building Block Hypothesis

- A genetic algorithm seeks near-optimal performance through the juxtaposition of short, low-order, high-performance schemata, called the building blocks
- The building block hypothesis has been found to apply in many cases but it depends on the representation and genetic operators used

Some thoughts ...

- GAs are simple to implement but share similar advantages and disadvantages with other stochastic optimization methods
- An additional critical point is the space representation
- operators are defined on strings and therefore an appropriate mapping of the search space needs to be defined

What should you do?

- Use with caution and when appropriate
- phenomenal opportunities for modeling complex systems and studying emergence and nonlinear phenomena
- Rule of thumb
- if the problem has a special structure use specialized algorithm
- if reasonable algorithms exist, use them
- if nothing is known then ... Improvise
- A lot of room for improvement based on systems approaches as taught in this course

Download Presentation

Connecting to Server..