Elementary concepts of neural networks
Advertisement
This presentation is the property of its rightful owner.
1 / 80

Elementary Concepts of Neural Networks PowerPoint PPT Presentation

Elementary Concepts of Neural Networks. Preliminaries of artificial neural network computation. Learning. Behavioral improvement through increased information about the environment . An experiment in learning. Pigeons as art experts (Watanabe et al. 1995) Experiment:

Related searches for Elementary Concepts of Neural Networks

Download Presentation

Elementary Concepts of Neural Networks

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Elementary concepts of neural networks

Elementary Concepts of Neural Networks

Preliminaries of artificial neural network computation


Learning

Learning

Behavioralimprovement through increased information about the environment.


An experiment in learning

An experiment in learning

  • Pigeons as art experts (Watanabe et al. 1995)

  • Experiment:

    • Pigeon in Skinner box

    • Present paintings of two different artists (e.g. Chagall / Van Gogh)

    • Reward when presented a particular artist (e.g. Van Gogh)


Pigeons as art experts

Pigeons as art experts

  • Pigeons were able to discriminate between Van Gogh and Chagall with 95% accuracy (when presented with pictures they had been trained on)

  • Discrimination still 85% successful for previously unseen paintings of the artists

  • Pigeons do not simply memorise the pictures

  • They can extract and recognise patterns (the ‘style’)

  • They generalise from the already seen to make predictions

  • This is what neural networks (biological and artificial) are good at (unlike conventional computer)


What are neural networks

What are Neural Networks?

  • Models of the brain and nervous system

  • Highly parallel

  • Learning

  • Very simple principles

  • Very complex behaviours

  • Applications

    • as biological models

    • as powerful problem solvers


Goals of neural computation

Goals of neural computation

  • To understand how the brain actually works

  • To understand a new style of computation inspired by neurons and their adaptive connections

    • Very different style from sequential computation

      • should be good for things that brain is good

      • should be bad for things that brain is bad

      • to solve practical problems by using novel learning algorithms

  • Learning algorithms can be very useful even if they have nothing to do with how the brain works


A typical cortical neuron

Gross physical structure:

There is one axon that branches

There is a dendritic tree that collects input from other neurons

Axons typically contact dendritic trees at synapses

A spike of activity in the axon causes charge to be injected into the post-synaptic neuron

Spike generation:

There is an axon that generates outgoing spikes whenever enough charge has flowed

A typical cortical neuron

axon

dendritic

tree


Brain vs network

Brain vs. Network

Brain neuron

Neural network


Synapses

When a spike travels along an axon and arrives at a synapse it causes vesicles of transmitter chemical to be released

The transmitter molecules diffuse across the synaptic cleft and bind to receptor molecules in the membrane of the post-synaptic neuron thus changing their shape.

The effectiveness of the synapse can be changed

Synapses are slow, but they have advantages over RAM

Massively parallel, they adapt using locally available signals (but how?)

Synapses


How the brain works

Each neuron receives inputs from other neurons

Some neurons also connect to receptors

Neurons use spikes to communicate

The timing of spikes is important

The effect of each input line on the neuron is controlled by a synaptic weight

The weights can be positive or negative

The synaptic weights adapt so that the whole network learns to perform useful computations

Recognizing objects, understanding language, making plans, controlling the body

How the brain works


Idealized neurons

Idealized neurons

  • To model things we have to idealize them (e.g. atoms)

    • Idealization removes complicated details that are not essential for understanding the main principles

    • Allows us to apply mathematics and to make analogies to other, familiar systems.

    • Once we understand the basic principles, its easy to add complexity to make the model more faithful

  • It is often worth understanding models that are known to be wrong (but we mustn’t forget that they are wrong!)

    • E.g. neurons that communicate real values rather than discrete spikes of activity.


Binary threshold neurons

Binary threshold neurons

  • McCulloch-Pitts (1943): influenced Von Neumann!

    • First compute a weighted sum of the inputs from other neurons

    • Then send out a fixed size spike of activity if the weighted sum exceeds a threshold.

    • Maybe each spike is like the truth value of a proposition and each neuron combines truth values to compute the truth value of another proposition!

1

1 if

y

0

0 otherwise

z

threshold


Linear neurons

Linear neurons

  • These are simple but computationally limited

    • If we can make them learn we may get insight into more complicated neurons

bias

th

y

i input

0

weight on

b

0

output

th

i

input

index over

input connections


Linear threshold neurons

Linear threshold neurons

These have a confusing name.

They compute a linear weighted sum of their inputs

The output is a non-linear function of the total input

y

0

0 otherwise

z

threshold


Sigmoid neurons

Sigmoid neurons

  • These give a real-valued output that is a smooth and bounded function of their total input.

    • Typically they use the logistic function

    • They have nice derivatives which make learning easy (see lecture 4).

  • Local basis functions (radial) are also used

1

0.5

0

0


Non linear neurons with smooth derivatives

For backpropagation, we need neurons that have well-behaved derivatives.

Typically they use the logistic function

The output is a smooth function of the inputs and the weights.

Non-linear neurons with smooth derivatives

1

0.5

0

0


Types of connectivity

Feedforward networks

These compute a series of transformations

Typically, the first layer is the input and the last layer is the output.

Recurrent networks

These have directed cycles in their connection graph. They can have complicated dynamics.

More biologically realistic.

Types ofconnectivity

output units

hidden units

input units


Types of learning task

Types of learning task

  • Supervised learning

    • Learn to predict output when given input vector

      • Who provides the correct answer?

  • Reinforcement learning

    • Learn action to maximize payoff

      • Not much information in a payoff signal

      • Payoff is often delayed

  • Unsupervised learning

    • Create an internal representation of the input e.g. form clusters; extract features

      • How do we know if a representation is good?


The mathematics of neural modeling

The mathematics of neural modeling


Single layer feed forward

Single Layer Feed-forward

Output layer

of

neurons

Input layer

of

source nodes


Multi layer feed forward

Multi layer feed-forward

3-4-2 Network

Output

layer

Input

layer

Hidden Layer


Recurrent networks

z-1

z-1

z-1

Recurrent networks

Recurrent Network with a hidden neuron system

input

hidden

output


The neuron

The Neuron

Bias

b

x1

w1

Activation

function

Local

Field

v

Output

y

Input

values

x2

w2

Summing

function

xm

wm

weights


The neuron1

The Neuron

  • The neuron is the basic information processing unit of a NN. It consists of:

    • A set of links, describing the neuron inputs, with weights W1, W2, …, Wm

    • An adder function (linear combiner) for computing the weighted sum of the inputs (real numbers):

    • Activation function (squashing function) for limiting the amplitude of the neuron output.


Bias as extra input

Bias as extra input

w0

x0 = +1

Activation

function

x1

w1

Local

Field

v

Input

signal

Output

y

x2

w2

Summing

function

Synaptic

weights

xm

wm


Neuron models

Neuron Models

  • The choice of f determines the neuron model

  • Step function:

  • Ramp function:

  • Sigmoid function:

  • Gaussian function (Radial Basis Functions)


Perceptron single neuron model

b (bias)

x1

w1

v

y

x2

w2

(v)

wn

xn

Perceptron: Single Neuron Model

  • The (McCulloch-Pitts) perceptron is a single layer NN with a non-linear , the sign function


Perceptron s geometric view

Perceptron’s geometric view

  • The equation below describes a (hyper-)plane in the input space consisting of real valued m-dimensional vectors. The plane splits the input space into two regions, each of them describing one class.

x2

w1x1 + w2x2 + w0 >= 0

decision

boundary

C1

x1

C2

w1x1 + w2x2 + w0 = 0


Learning with hidden units

Learning with hidden units

  • Networks without hidden units are very limited in the input-output mappings they can model.

    • More layers of linear are still linear.

  • We need multiple layers of adaptive non-linear hidden units. This gives us a universal approximator. But how can we train such nets?

    • We need an efficient way of adapting all the weights is hard


Learning by perturbing weights

Randomly perturb one weight and see if it improves performance. If so, save the change.

Very inefficient. We need to do multiple forward passes on a representative set of training data just to change one weight.

Towards the end of learning, large weight perturbations will nearly always make things worse.

We could randomly perturb all the weights in parallel and correlate the performance gain with the weight changes.

Not any better because we need lots of trials to “see” the effect of changing one weight through the noise created by all the others.

Learning by perturbing weights


The idea behind backpropagation

We don’t know what the hidden units ought to do, but we can compute how fast the error changes as we change a hidden activity.

Instead of using desired activities to train the hidden units, use error derivatives w.r.t. hidden activities.

Each hidden activity can affect many output units and can therefore have many separate effects on the error. These effects must be combined.

We can compute error derivatives for all the hidden units efficiently.

Once we have the error derivatives for the hidden activities, its easy to get the error derivatives for the weights going into a hidden unit.

The idea behind backpropagation


Sketch of backpropagation d rule

Sketch of backpropagation (d-rule)

let’s derive it ....


Ways to use weight derivatives

How often to update

after each training case?

after a full sweep through the training data?

How much to update

use a fixed learning rate?

adapt the learning rate?

don’t use steepest descent?

Ways to use weight derivatives


Overfitting

Overfitting

  • The training data contains information about the regularities in the mapping from input to output. But it also contains noise

    • The target values may be unreliable.

    • There is sampling error. There will be accidental regularities just because of the particular training cases that were chosen.

  • When we fit the model, it cannot tell which regularities are real and which are caused by sampling error.

    • So it fits both kinds of regularity.

    • If the model is very flexible it can model the sampling error really well. This is a disaster.


Simple overfitting example

Which model do you believe?

The complicated model fits the data better.

But it is not realistic!

A model is convincing when it fits a lot of data surprisingly well.

It is not surprising that a complicated model can fit a small amount of data.

Ockam’s Razor

Simple overfitting example


Neural network training as a mathematical programming problem

Neural Network Training as a Mathematical Programming Problem


Key characteristics

Key characteristics

  • NNs are versatile and “general” models

  • Require little, if any, insight

  • Usually impossible to interpret

    • is this yet another multivariate parameter estimation approach?

  • Well ... It depends on how they are used..

    • The basic concept behind NN modeling is to identify complex emergent behavior by combining simple elements

    • NNs should not be viewed as exercises in optimization

  • People either love of hate NNs !!!


Some thoughts

Some thoughts ...

  • How do we interpret (artificial) NNs?

  • Nature shows 3 key characteristics

    • highly robust (recover memory w/ partial knowledge ... see next page)

    • highly adaptable (connections created and/or bypassed)

    • complexity emerging from simplicity

  • These could be the results of MASSIVE PARALLELISM

    • 1012-1012 neurons

    • 1014 synapses

  • Can we really built models like that?


  • Making associations in fuzzy environments

    Making associations in fuzzy environments


    Some thoughts1

    Some thoughts ...

    • How do we interpret (artificial) NNs?

    • Nature shows 3 key characteristics

      • highly robust (recover memory w/ partial knowledge ... see next page)

      • highly adaptable (connections created and/or bypassed)

      • complexity emerging from simplicity

  • These could be the results of MASSIVE PARALLELISM

    • 1012-1012 neurons

    • 1014 synapses

  • Can we really built models like that?


  • Applications

    Applications

    • Too many ... Anytime you look for an I/O relation and you lack fundamental understanding and first principles (or even “gray”) models

      • optimization Hopfield Networks

      • classification FFNN/BP

      • dimensionality reduction Autoassociative NNs

      • visualization SOM

      • modeling Recurrent NNs

      • cognitive sciences

        • a very legitimate domain ...


    Dimensionality reduction

    Dimensionality reduction

    • PCA, nonlinear PCA ...


    Recurrent networks1

    Recurrent Networks

    • A recurrent network with 5 nodes

    4

    x1

    1

    z4

    x2

    x3

    3

    x4

    2

    z5

    x5

    5


    Memories as attractors

    Partial pattern

    Memories as Attractors

    • Attractor Network [Hopfield, 1982]

      • store memories as dynamical attractors

    Recover memory using partial

    information


    Stochastic optimization

    Stochastic Optimization

    Basic preliminaries of Simulated Annealing and Genetic Algorithms


    Optimization dynamic systems and iterative maps

    Optimization, dynamic systems and iterative maps

    let’s talk about that ...


    Why stochastic algorithms

    Why stochastic algorithms

    • Like poker ... 15 min to learn a lifetime to master ...

    • Deceptively easy to grasp and implement ... which means that the implementations can become tricky..

    • It’s straightforward to incorporate domain specific knowledge

    • Will always be producing something

      • insensitive to minor details such as differentiability, scaling, bad modeling, etc.

    • They have physical analogues making them attractive to physical scientists


    Why not stochastic algorithms

    Why NOT stochastic algorithms

    • Convergence properties are ONLY asymptotic

    • Incorporating constrains is HIGHLY non-trivial

      • corollary: maintaining feasibility is HIGHLY non-trivial unless appropriate heuristics are used ... But then again is this stochastic or biased?

    • Can become EXTREMELY expensive (computationally) since functions are evaluate without any specific goal in mind (use with caution ...)

    • Rule of thumb

      • if the problem has a special structure use specialized algorithm

      • if reasonable algorithms exist, use them

      • if nothing is known then ... improvise


    Simulated annealing

    Simulated Annealing

    Let’s talk about annealing


    The algorithm

    The algorithm

    for m=1 to M{

    generate random move

    evaluate DE

    if(DE < 0){/* downhill; accept */

    accept move;update configuration

    }

    else{/* uphill; accept (?) */

    accept with P=exp(- DE/T)

    update configuration if accepted

    }

    }


    The main issues

    The main issues

    • The move sets

      • how to create new configurations

        • random and/or heuristics

    • The cooling schedule

      • the length of the Markov Chain (M)

      • cooling schedule

    • Convergence

      • asymptotic and probabilistic


    Some thoughts2

    Some thoughts ...

    • Direct methods have shown advantages when

      • the combinatorial complexity is overwhelming

      • the model is implicit, noisy and/or non-differentiable

    • Extensions to continuous problems are not trivial

    • SA is a framework rather than a specific algorithm

      • multiple variants


    Genetic algorithms

    Genetic Algorithms

    Let’s talk about the survival of the fittest


    Genetic algorithms ga overview

    Genetic Algorithms (GA) OVERVIEW

    • A class of probabilistic optimization algorithms

    • Inspired by the biological evolution process

    • Uses concepts of “Natural Selection” and “Genetic Inheritance” (Darwin 1859)

    • Originally developed by John Holland (1975)

    • Particularly well suited for hard problems where little is known about the underlying search space

    • Widely-used in business, science and engineering


    What is a ga

    What is a GA

    • A genetic algorithm maintains a population of candidate solutions for the problem at hand, and makes it evolve by iteratively applying a set of stochastic operators


    Stochastic operators

    Stochastic operators

    • Selection replicates the most successful solutions found in a population at a rate proportional to their relative quality

    • Recombination decomposes two distinct solutions and then randomly mixes their parts to form novel solutions

    • Mutation randomly perturbs a candidate solution


    The metaphor

    Genetic Algorithm

    Nature

    Optimization problem

    Environment

    Feasible solutions

    Individuals living in that environment

    Solutions quality (fitness function)

    Individual’s degree of adaptation to its surrounding environment

    The Metaphor


    The metaphor cont

    Genetic Algorithm

    Nature

    A set of feasible solutions

    A population of organisms (species)

    Stochastic operators

    Selection, recombination and mutation in nature’s evolutionary process

    Iteratively applying a set of stochastic operators on a set of feasible solutions

    Evolution of populations to suit their environment

    The Metaphor (cont)


    Simple genetic algorithm

    Simple Genetic Algorithm

    produce an initial population of individuals

    evaluate the fitness of all individuals

    while termination conditions not met do

    select fitter individuals for reproduction

    recombine between individuals

    mutate individuals

    evaluate the fitness of the modified individuals

    generate a new population

    End while


    The evolutionary cycle

    The Evolutionary Cycle

    parents

    selection

    modification

    modified

    offspring

    evaluation

    population

    evaluated offspring

    deleted

    members

    discard

    initiate &

    evaluate


    Example initialization

    Example (initialization)

    • We toss a fair coin 60 times and get the following initial population:

    s1 = 1111010101f (s1) = 7

    s2 = 0111000101f (s2) = 5

    s3 = 1110110101f (s3) = 7

    s4 = 0100010011f (s4) = 4

    s5 = 1110111101f (s5) = 8

    s6 = 0100110000f (s6) = 3


    Example selection1

    Individual i will have a

    probability to be chosen

    Example (selection1)

    Next we apply fitness proportionate selection with the roulette wheel method:

    Area is Proportional to fitness value

    1

    2

    We repeat the extraction as many times as the number of individuals we need to have the same parent population size (6 in our case)

    n

    3

    4


    Example selection2

    Example (selection2)

    Suppose that, after performing selection, we get the following population:

    s1` = 1111010101(s1)

    s2` = 1110110101(s3)

    s3` = 1110111101(s5)

    s4` = 0111000101 (s2)

    s5` = 0100010011 (s4)

    s6` = 1110111101 (s5)


    Example crossover1

    Example (crossover1)

    • Next we mate strings for crossover. For each couple we decide according to crossover probability (for instance 0.6) whether to actually perform crossover or not

    • Suppose that we decide to actually perform crossover only for couples (s1`, s2`) and (s5`, s6`). For each couple, we randomly extract a crossover point, for instance 2 for the first and 5 for the second


    Example crossover2

    Before crossover:

    s1` = 1111010101s2` = 1110110101

    s5` = 0100010011 s6` = 1110111101

    After crossover:

    s1`` = 1110110101s2`` = 1111010101

    s5`` = 0100011101s6`` = 1110110011

    Example (crossover2)


    Example mutation1

    Example (mutation1)

    The final step is to apply random mutation: for each bit that we are to copy to the new population we allow a small probability of error (for instance 0.1)

    Before applying mutation:

    s1`` = 1110110101

    s2`` = 1111010101

    s3`` = 1110111101

    s4`` = 0111000101

    s5`` = 0100011101

    s6`` = 1110110011


    Example mutation2

    Example (mutation2)

    After applying mutation:

    s1``` = 1110100101f (s1``` ) = 6

    s2``` = 1111110100f (s2``` ) = 7

    s3``` = 1110101111f (s3``` ) = 8

    s4``` = 0111000101f (s4``` ) = 5

    s5``` = 0100011101f (s5``` ) = 5

    s6``` = 1110110001f (s6``` ) = 6


    Components of a ga

    Components of a GA

    • A problem definition as input, and

      • Encoding principles (gene, chromosome)

      • Initialization procedure (creation)

      • Selection of parents (reproduction)

      • Genetic operators (mutation, recombination)

      • Evaluation function (environment)

      • Termination condition


    The traveling salesman problem tsp

    The Traveling Salesman Problem (TSP)

    • The traveling salesman must visit every city in his territory exactly once and then return to the starting point; given the cost of travel between all cities, how should he plan his itinerary for minimum total cost of the entire tour?

    • TSP  NP-Complete


    Tsp representation evaluation initialization and selection

    TSP (Representation, Evaluation, Initialization and Selection)

    • A vector v = (i1 i2… in) represents a tour (v is a permutation of {1,2,…,n})

    • Fitness f of a solution is the inverse cost of the corresponding tour

    • Initialization: use either some heuristics, or a random sample of permutations of {1,2,…,n}

    • We shall use the fitness proportionate selection


    Tsp heuristic inversion

    TSP Heuristic (Inversion)

    • The sub-string between two randomly selected points in the path is reversed

    • Example:

      • (1 2 3 4 5 6 7 8 9) .. (1 2 7 6 5 4 3 8 9)

    • Such simple inversion guarantees that the resulting offspring is a legal tour


    Notation schema

    Notation (schema)

    • {0,1,#} is the symbol alphabet, where # is a special wild card symbol

    • A schema is a template consisting of a string composed of these three symbols

    • Example: the schema [01#1#] matches the strings: [01010], [01011], [01110] and [01111]


    Notation order

    Notation (order)

    • The order of the schema S (denoted by o(S)) is the number of fixed positions (0 or 1) presented in the schema

    • Example

      • for S1 = [01#1#], o(S1) = 3

      • for S2 = [##1#1010], o(S2) = 5

    • The order of a schema is useful to calculate survival probability of the schema for mutations


    Schema theorem

    Schema Theorem

    • Short, low-order, above-average schemata receive exponentially increasing trials in subsequent generations of a genetic algorithm

    • Result: GAs explore the search space by short, low-order schemata which, subsequently, are used for information exchange during crossover


    Building block hypothesis

    Building Block Hypothesis

    • A genetic algorithm seeks near-optimal performance through the juxtaposition of short, low-order, high-performance schemata, called the building blocks

    • The building block hypothesis has been found to apply in many cases but it depends on the representation and genetic operators used


    Some thoughts3

    Some thoughts ...

    • GAs are simple to implement but share similar advantages and disadvantages with other stochastic optimization methods

    • An additional critical point is the space representation

      • operators are defined on strings and therefore an appropriate mapping of the search space needs to be defined


    What should you do

    What should you do?

    • Use with caution and when appropriate

      • phenomenal opportunities for modeling complex systems and studying emergence and nonlinear phenomena

    • Rule of thumb

      • if the problem has a special structure use specialized algorithm

      • if reasonable algorithms exist, use them

      • if nothing is known then ... Improvise

    • A lot of room for improvement based on systems approaches as taught in this course


  • Login