Introduction to Artificial Neural Networks

Introduction toArtificial Neural Networks Neural networks do not perform miracles. But if used sensibly they can produce some amazing results. Presented by: Ghayas Ur Rehman Course Trainer: Dr. TehseenJilani Department of Computer Science University of Karachi

Back propagation Network: • It is the most widely used architecture. It is very popular technique that is relatively easy to implement. It requires large amount of training data for conditioning the network before using it for predicting the outcome. • A back-propagation network includes at-least one hidden layer. • The approach is considered as “feed-forward/ back propagation” approach. • Limitations: • NNs do not do well at tasks that are not driven well by people. • They lack the explaining facility. • Training time can be excessive .

Simple BPNN

BPNN in simple words

When not to use BPNN? • A back-propagation neural network is only practical in certain situations. Following are some guidelines on when you should use another approach: • Can you write down a flow chart or a formula that accurately describes the problem? If so, then stick with a traditional programming method. • Is there a simple piece of hardware or software that already does what you want? If so, then the development time for a NN might not be worth it. • Do you want the functionality to "evolve" in a direction that is not pre-defined?

When not to use BPNN? (Contd.) • Do you have an easy way to generate a significant number of input/output examples of the desired behavior? If not, then you won't be able to train your NN to do anything. • Is the problem is very "discrete"? Can the correct answer can be found in a look-up table of reasonable size? A look-up table is much simpler and more accurate. • Are precise numeric output values required? NN's are not good at giving precise numeric answers.

When to use BPNN? • Conversely, here are some situations where a BP NN might be a good idea: • A large amount of input/output data is available, but you're not sure how to relate it to the output. • The problem appears to have overwhelming complexity, but there is clearly a solution. • It is easy to create a number of examples of the correct behavior. • The solution to the problem may change over time, within the bounds of the given input and output parameters (i.e., today 2+2=4, but in the future we may find that 2+2=3.8). • Outputs can be "fuzzy", or non-numeric.

Back-propagation algorithm • The most popular & successful method. • Steps to be followed for the training: • Select the next training pair from the training set( input vector and the output). • Present the input vector to the network. • Network calculate the output of the network. • Network calculates the error between the network output and the desired output. • Network back propagates the error • Adjust the weights of the network in a way that minimizes the error. • Repeat the above steps for each vector in the training set until the error is acceptable, for each training data set..

Back-propagation algorithm Step 1: Feed forward the inputs through networks: a0 = p am+1 = fm+1 (Wm+1am + bm+1), where m = 0, 1, ..., M– 1. a =aM Step 2: Back-propagate the sensitive (error): at the output layer at the hidden layers where m = M– 1, ..., 2, 1. Step 3: Finally, weights and biases are updated by following formulas: . (Details on constructing the algorithm and other related issues should be found on text book Neural Network Design)

Network Training • Supervised Learning • Network is presented with the input and the desired output. • Uses a set of inputs for which the desired outputs results / classes are known. The difference between the desired and actual output is used to calculate adjustment to weights of the NN structure • Unsupervised Learning • Network is not shown the desired output. • Concept is similar to clustering • It tries to create classification in the outcome.

Unsupervised Learning • Only input stimuli (parameters) are presented to the network. The network is self organizing, that is, it organizes itself internally, so that each hidden processing elements and weights responds appropriately to a different set of input stimuli. • No knowledge is supplied about the classification of outputs. However, the number of categories into which the network classifies the inputs can be controlled by varying certain parameters in the model. In any case, human expert must examine the final classifications to assign a meaning & usefulness of results. • Reinforcement Learning • In between Supervised & Unsupervised learning. • Network gets a feedback from the environment.

Learning ( Training) Algorithms The training process requires a set of properly selected data in the form of network inputs and target outputs. During training, the weights and biases are iteratively adjusted to minimize the network performance function ( error). The default performance function is mean square error. Input data should be independent. Back- Propagation learning algorithm There are many variation. The commonly used one is: gradient descent algorithm: x k+1 = xk - kgk Where xk is a vector of current weights and biases and gk is current gradient and k is the chosen learning rate.

Back Propagation Learning Algorithm • It is the most commonly used generalization of the delta rule. This procedure involves two phases • Forward phase: when the input is presented, it propagates forward through the network to compute output values for each processing element. For each PE all the current outputs are compared with the desired outputs and the error is computed. • Backward phase: The calculated error in now fed backward and weights are adjusted. • After completing both the phases, a new input is presented for the further training. • This technique is slow and can cause instability and has tendency to stuck in a local minima, but it is still very popular.

Gradient Descent Algorithm The idea is to calculate an error each time the network is presented with a training vector (given that we have supervised learning where there is a target vector) and to perform a gradient descent on the error - considered as function of the weights. There will be a gradient or slope for each weight. Thus, we find the weights which give the minimal error. Typically the error criterion is defined by the square of the difference between the pattern output and the target output( least squared error). The total error E, is then just the sum of the pattern error square.

Error function (LMS) Target output Note: LMS = least mean square Network output

This method of weight adjustment is also known as steepest gradient descent technique or Widrow and Hoff rule and is most common type. This is also known as Delta rule.

Network Learning Rules Hebbian Rule The first and the best known learning rule was introduced by Donald Hebb. This basic rule is: If a neuron receives an input from another neuron, and if both are highly active (mathematically have the same sign), the weight between the neurons should be strengthened. where xi(t) and yj(t) are the outputs at nodes i and j. wij are the weights between the nodes i and j

Backpropagation: The Math • General multi-layered neural network Output Layer 0 1 2 3 4 5 6 7 8 9 X9,0 X0,0 X1,0 Hidden Layer 0 1 i Wi,0 W1,0 W0,0 Input Layer 0 1

Backpropagation: The Math • Backpropagation • Calculation of hidden layer activation values

Backpropagation: The Math • Backpropagation • Calculation of output layer activation values

Backpropagation: The Math • Backpropagation • Calculation of error dk = f(Dk) -f(Ok)

Backpropagation: The Math • Backpropagation • Gradient Descent objective function • Gradient Descent termination condition

Backpropagation: The Math • Backpropagation • Output layer weight recalculation Learning Rate (eg. 0.25) Error at k

Backpropagation: The Math • Backpropagation • Hidden Layer weight recalculation

Backpropagation Using Gradient Descent • Advantages • Relatively simple implementation • Standard method and generally works well • Disadvantages • Slow and inefficient • Can get stuck in local minima resulting in sub-optimal solutions

Local Minima Local Minimum Global Minimum

Alternatives To Gradient Descent • Simulated Annealing • Advantages • Can guarantee optimal solution (global minimum) • Disadvantages • May be slower than gradient descent • Much more complicated implementation

Alternatives To Gradient Descent • Genetic Algorithms/Evolutionary Strategies • Advantages • Faster than simulated annealing • Less likely to get stuck in local minima • Disadvantages • Slower than gradient descent • Memory intensive for large nets

Alternatives To Gradient Descent • Simplex Algorithm • Advantages • Similar to gradient descent but faster • Easy to implement • Disadvantages • Does not guarantee a global minimum

Enhancements To Gradient Descent • Momentum • Adds a percentage of the last movement to the current movement

Enhancements To Gradient Descent • Momentum • Useful to get over small bumps in the error function • Often finds a minimum in less steps • w(t) = -n*d*y + a*w(t-1) • w is the change in weight • n is the learning rate • d is the error • y is different depending on which layer we are calculating • a is the momentum parameter

Enhancements To Gradient Descent • Adaptive BackpropagationAlgorithm • It assigns each weight a learning rate • That learning rate is determined by the sign of the gradient of the error function from the last iteration • If the signs are equal it is more likely to be a shallow slope so the learning rate is increased • The signs are more likely to differ on a steep slope so the learning rate is decreased • This will speed up the advancement when on gradual slopes

Enhancements To Gradient Descent • Adaptive Backpropagation • Possible Problems: • Since we minimize the error for each weight separately the overall error may increase • Solution: • Calculate the total output error after each adaptation and if it is greater than the previous error reject that adaptation and calculate new learning rates

Enhancements To Gradient Descent • SuperSAB(Super Self-Adapting Backpropagation) • Combines the momentum and adaptive methods. • Uses adaptive method and momentum so long as the sign of the gradient does not change • This is an additive effect of both methods resulting in a faster traversal of gradual slopes • When the sign of the gradient does change the momentum will cancel the drastic drop in learning rate • This allows for the function to roll up the other side of the minimum possibly escaping local minima

Enhancements To Gradient Descent • SuperSAB • Experiments show that the SuperSAB converges faster than gradient descent • Overall this algorithm is less sensitive (and so is less likely to get caught in local minima)

Other Ways To Minimize Error • Varying training data • Cycle through input classes • Randomly select from input classes • Add noise to training data • Randomly change value of input node (with low probability) • Retrain with expected inputs after initial training • E.g. Speech recognition

Other Ways To Minimize Error • Adding and removing neurons from layers • Adding neurons speeds up learning but may cause loss in generalization • Removing neurons has the opposite effect

Applications of Backpropagation • In image analysis • Text in image recognition. • Finding oil fields. • Source Code recognition. • Reproducing similar sound. • Robotics

Code recognizer

Case study • A Mad scientist wants to make billions of dollars by controlling the stock market. He will do this by controlling the stock purchases of several wealthy people. The scientist controls information that can be given by wall street insiders and has a device to control how much different people can trust each other. Using his ability to input insider information and control trust between people, he will control the purchases by wealthy individuals. If purchases can be made that are ideal to the mad scientist, he can gain capital by controlling the market.

Information is planted at the top level to Wall Street insiders. They then relay this information to stock brokers who are their friends. The brokers then relay that information to their favorite wealthy clients who then make trades. The weight for each edge is the amount of trust that person has for the person above them. The more they trust a person, the more likely they are to either pass along information or make a trade based on the information.

Case study (Contd…) • As a mad scientist, you will need to adjust this social network in order to create optimal actions in the market place. You do this using your secret Trust 'o' Vac 2000. With it you can increase or decrease each trust weight how you see fit. You then observe the trades that are made by the rich dudes. If the trades are not to your liking, then we consider this error. The more to your liking the trades are, the less error they contain. Ideally, you want to slowly adjust the network so that it gets closer and closer to what you want and contains less error. In general terms this is referred to as gradient descent.

As you place insider information, you observe the amount of error coming out of your network. If a person is making trades that rather poor you need to figure out where they are getting the information to do so. A strong trust (shown by a thick line) indicates where more error is coming from and where larger changes need to be made

Case study (Contd…) There are many ways in which we can adjust the trust weights, but we will use a very simple method here. Each time we place some insider information, we watch the trades that come from our rich dudes. If there is a large error coming from one rich dude, then they are getting bad information from someone they trust too much or are not getting good information from someone they should trust more. When the mad scientist sees this, he uses the Trust 'o' Vac 2000 to weaken a strong trust by a little and strengthen a weak trust by a little. Thus, we try to slowly cut off the source of bad information and increase the source of good information going to the rich dudes

We next have to adjust the trust weights between the CEO's and the brokers. We do this by propagating error backwards: if a strong weight exists between a broker and a rich dude who is making bad purchases on a regular basis, then we can attribute that error to the broker. We can then make the rich dude trust this broker less and also adjust the weights of trust between the broker and the fat cats in a similar way

The End Thanks for your patience

Introduction to Artificial Neural Networks