MITM 613 Intelligent System

MITM 613Intelligent System Chapter 8: Neural networks

Chapter Eight : Neural Networks • 8.1 Introduction • 8.2 Neural network applications • Nonlinear estimation • Classification • Clustering • Content-addressable memory • 8.3 Nodes and interconnections • 8.4 Single and multilayer perceptrons • Network topology • Perceptrons as classifiers • Training a perceptron • Hierarchical perceptrons • Some practical considerations • 8.5 The Hopfield network • 8.6 MAXNET • 8.7 The Hamming network • 8.8 Adaptive Resonance Theory (ART) networks • 8.9 Kohonen self-organizing networks • 8.10 Radial basis function networks Abdul Rahim Ahmad

Artificial neural networks Abdul Rahim Ahmad

ANN - A family of techniques for numerical learning. Consist of many nonlinear computational elements which form the network nodes or neurons, linked by weighted interconnections. Analogous in structure to the biological neurological system but are much simpler and effective certain tasks, such as classification. Generally neural network is taken to mean artificial neural network. Artificial Neural Networks (ANN) Abdul Rahim Ahmad

Using neural networks is described as connectionism. Each node in a neural network may have several inputs, each of which has an associated weighting. The node performs a simple computation on its input values, which are single integers or real numbers, to produce a single numerical value as its output. The output from a node can either form an input to other nodes or be part of the output from the network as a whole. Artificial neural networks Abdul Rahim Ahmad

Overall effect -> a pattern of numbers is generated at its outputs in response to a pattern of numbers at its inputs. These patterns of numbers are one-dimensional arrays known as vectors, e.g., (0.1, 1.0, 0.2). Each neuron performs its computation independently. Outputs from some neurons may form the inputs to others. Thus, neural networks have a highly parallel structure, allowing them to explore many competing hypotheses simultaneously. Artificial neural networks Abdul Rahim Ahmad

Parallelism allows to chance to take advantage of parallel processing computers. ANN can run on conventional serial computers, except that it is longer. ANN are tolerant of the failure of individual neurons or interconnections. ANN performance degrade gracefully if the localized failures within the network occur. The weights on the node interconnections, together with the overall topology, define the output vector that is derived by the network from a given input vector. Artificial neural networks Abdul Rahim Ahmad

In supervised learning: • Examples are presented along withthe corresponding desired output vectors. • Weight is adjusted with each iteration until the actual output for each input is close to the desired vector. • In unsupervised learning: • Examples are presented without any corresponding desired output vectors. • Weight is adjusted in accordance with naturally occurring patterns in the data using a suitable training algorithm. • Output vector represents the position of the input vector within the discovered patterns of the data. The Weights Abdul Rahim Ahmad

When presented with noisy/incomplete data, ANN produce approximate answer rather than incorrect. When presented with unfamiliar data within the range of its previously seen examples, ANN will generally produce a reasonable output interpolated between the example outputs. However, ANN is unable to extrapolate reliably beyond the range of the previously seen examples. For Interpolation we can use fuzzy logic. Therefore, ANN and fuzzy logic are alternative solutions to engineering problem and may be combined in a hybrid system. Abdul Rahim Ahmad

ANN can be applied to many tasks. • ANN associates input vector (x1, x2, … xn) with output vector (y1, y2, … ym) • The function linking the input and output may be unknown and can be highly nonlinear. • A linear function is one that can be represented as f(x) = mx+ c, where m and c are constants; • a nonlinear one may include higher order terms for x, or trigonometric or logarithmic functions of x.) ANN applications Abdul Rahim Ahmad

ANN technique can determine values of variables that cannot be measured easily, but known to depend on other more accessible variables. • The measurable variables form the network input vector and the unknown variables constitute the output vector. • In Nonlinear estimation, the network is initially trained using a set of examples known as the training data. • Supervised learning is used; • i.e: each example in the training comprises two vectors: an input vector and its corresponding desired output vector. • This assumes that some values for the less accessible variable have been obtained to form the desired outputs. Application 1: Nonlinear estimation Abdul Rahim Ahmad

During training, the network learns to associate the example input vectors with their desired output vectors. When it is subsequently presented with a previously unseen input vector, the network is able to interpolate between similar examples in the training data to generate an output vector. Application 1: Nonlinear estimation Abdul Rahim Ahmad

Output vector classify input into one of a set of known possible class. • Example: speech recognition system: • Classify input into 3 different words: yes, no, and maybe. • Input: Preprocessed digitized sound of the words • Output: (0, 0, 1) for yes (0, 1, 0) for no (1, 0, 0) for maybe. • During training, the network learns to associate similar input vectors with a particular output vector. • When it is subsequently presented with a previously unseen input vector, the network selects the output vector that offers the closest match. Classification Abdul Rahim Ahmad

Unsupervised learning • Input vectors are clustered into N groups, (N is integer, may be prespecified or may be allowed to grow according to the diversity of the data). • Example: In speech recognition • Input : only spoken words • Training: cluster together examples that is similar to each other. (eg: according to different words or voices). • Once the clusters have formed, a second neural network is trained to associate each cluster with a particular desired output. • The overall system then becomes a classifier, where the first network is unsupervised and the second one is supervised. • Clustering is useful for data compression and is an important aspect of data mining, i.e., finding patterns in complex data. Clustering Abdul Rahim Ahmad

A form of unsupervised learning. no desired output vectors associated with the training data. During training, each example input vector becomes stored in a dispersed form through the network. When a previously unseen vector is subsequently presented to the network, it is treated as though it were an incomplete or error-ridden version of one of the stored examples. So the network regenerates the stored example that most closely resembles the presented vector. This can be thought of as a type of classification, where each of the examples in the training data belongs to a separate class, and each represents the ideal vector for that class. Content-addressable memory Abdul Rahim Ahmad

Node or neuron is a simple computing element having an input side and an output side. Each node may have directional connections to many other nodes at both its input and output sides. Each input xi is multiplied by its associated weight wi. Typically, the node’s role is to sum each of its weighted inputs and add a bias term w0 to form an intermediate quantity called the activation, a. It then passes the activation through a nonlinear function ft known as the transfer function or activation function. Figure shows the function of a single neuron. Nodes and interconnections Abdul Rahim Ahmad

The behavior of a neural network depends on its topology, the weights, the bias terms, and the transfer function. The weights and biases can be learned, and the learning behavior of a network depends on the chosen training algorithm. Typically a sigmoid function is used as the transfer function For each neuron, the activation function is given by: where n is the number of inputs and the bias term w0 is defined separately for each node. Nodes and interconnections Abdul Rahim Ahmad

Non-linear transfer function: Typical Transfer Functions • Sigmoid function • Step function • Ramp function Abdul Rahim Ahmad

The neurons are organized in layers. Each neuron is totally connected to the neurons in the layers above and below, but not to the neurons in the same layer. These networks are also called feed forward networks. MLPs can be used either for classification or as nonlinear estimators. Number of nodes in each layer and the number of layers are determined by the network builder, often on a trial-and-error basis. There is always an input layer and an output layer; the number of nodes in each is determined by the number of inputs and outputs being considered. MultiLayer Perceptron (MLP) Abdul Rahim Ahmad

Can have any number of hidden layers between input and output layers. • have no obvious meaning associated with them. • If no hidden layers, the network is a single layer perceptron (SLP). • Network shown has • Three input nodes • Two hidden layers with four nodes each. • One output layer of two nodes. • Short form name is 3–4–4–2 MLP. MultiLayer Perceptron (MLP) Abdul Rahim Ahmad

Feed data forwards along input layer, hidden layers, to the output layer. • Inputs to a node are the outputs from each node in the previous layer except nodes in the input layer. • At each node except input layer, the data are weighted, summed, added to the bias, and then passed through the transfer function. • In the counting of layers, the input nodes is not included since it do not perform any processing • The network in the figure is a three layer MLP. MultiLayer Perceptron (MLP) Abdul Rahim Ahmad

Normally there is one input node for each element of the input vector and one output node for each element of the output vector. • Each output node would usually represent a particular class • Typical representation for a class would be • ~1 for one class and the rest ~0. • For the case it does not fall into any class, the winning node must exceed a predetermined threshold such as 0.5. • Other representations are such as two output nodes to represent four classes. Eg: (0,0), (0,1), (1,0), and (1,1). Perceptrons as classifiers Abdul Rahim Ahmad

Example: Single layer perceptron • Input : 2 neuron • Output: 3 classes • Each class has 1 dividing line • Linearly separable Linear classifiers • Output, prior to application of the transfer function, is given by • The dividing criterion is assumed to be a = 0 corresponding to output of 0.5 after the application of the sigmoid transfer function • Thus the hyperplane that separates the two regions is given by: • In the form of a straight line : Abdul Rahim Ahmad

Multilayer perceptron • one hidden layer • a differentiable, i.e., smooth, transfer function such as the sigmoid function Nonlinear classifiers • First layer divides the state space with straight lines (or hyperplanes), • 2nd layer forms multifaceted regions by Boolean combinations (AND, OR, and NOT) of the linearly separated regions. • To perform any nonlinear mapping or classification with an MLP: • With sigmoid transfer function, one hidden layer is needed. • With step transfer function, less than two hidden layers are required. • Learning cannot be guaranteed; final topology involves trial and error. start small then expand. Abdul Rahim Ahmad

Training separate the regions in state space by adjusting its weights and bias. Difference between the generated value and the desired value is the error The overall error is expressed as the root mean squares (RMS) of the errors (both –ve and +ve) Training minimized RMS by altering the weights and bias, through many passes of the training data. This search for weights and biases that gives the minimum RMS error is an optimization problem with RMS error as the cost function. When RMS error is within a small range, we say that the network converged. Training a perceptron Abdul Rahim Ahmad

Most common is : back-error propagation (BP) algorithm (or generalized delta rule) A gradient-proportional descent technique with continuous and differentiable transfer function such as sigmoid. For sigmoid function , the derivative is Training Algorithm Abdul Rahim Ahmad

Nomenclature for BP algorithm Abdul Rahim Ahmad

In BP, biases is always 1. Delta rule determines the modifications to the weights as follows: for all nodes j in layer A and all nodes i in layer B after .(B = A + 1). Neurons in the output and hidden layers have error term, δ. When the sigmoid transfer function is used, δAj is given by: BP Training Algorithm Abdul Rahim Ahmad

learning rate, η, is applied to the calculated values for δAj and should be about 0.35. Sometimes, momentum coefficient, α is included. Momentum term forces changes in weight to be dependent on previous weight changes. Momentum coefficient must be in the range 0–1. Some suggest to set α to be 0.0 for the first few training passes and then increased to 0.9. BP Training Algorithm (cont.) Abdul Rahim Ahmad

BP algorithm • 2 stages • Gather error term • Update weights • Repeat as many times required Abdul Rahim Ahmad

In complex problems recommended to divide MLP into several smaller MLPs arranged in a hierarchy. Each MLP independent from each other, can be trained separately or in parallel. Hierarchical Perceptrons Abdul Rahim Ahmad

Stop training if RMS error remains constant so as not to over-train the network (expert at giving correct output for training data, but not with new data). • Some reasons : • too many cycles of training • over-complex network (many hidden layers or numbers of neurons) • To avoid: • divide the data into training, testing, and validation. • Use leave-one-out method • Use scaled data. Some Practical Considerations Abdul Rahim Ahmad

Effects of Over training Abdul Rahim Ahmad

One layer: nodes are used for both input and output Used as a content-addressable memory Input: binary, (1 and -1) Output: binary Transfer function ft is step nonlinearity. If network has Nn nodes, then the input and output would comprise Nn binary digits. Weights and biases are set according to the following equations: Hopfield network • where wij is the weighting on the connection from node i to node j, wi0 is the • bias on node i, and xik is the ith digit of example k. There are no circular • connections from a node to itself, hence wij = 0 where i = j. Abdul Rahim Ahmad

Setting weights constitutes the learning phase, results in the examples being stored in a distributed fashion in the network • A new input, is initially the output, too, (as nodes are used for both input and output). • The node function is performed on each node in parallel. If this is repeated many times, the output will be progressively modified and will converge on the example that most closely resembles the initial input. • number of examples (Ne) should not exceed 0.15Nn Learning in Hopfield Abdul Rahim Ahmad

Topology : identical to the Hopfield network Difference: weights on the circular interconnections, wii, are not always zero. Used to recognize which of its inputs has the highest value. Used in conjunction with MLP to select output node that generates the highest value. interconnection weights are set as follows MAXNET • Circular connections from a node to itself are allowed in the MAXNET, but are disallowed in the Hopfield network Abdul Rahim Ahmad

Comparison Abdul Rahim Ahmad

Has two parts : • a 2 layer feed forward network : used to compare the input vector with each of the examples, awarding a matching score to each example • a MAXNET: used to pick out the example that has attained the highest score • The overall effect is to categorize the input vector Hamming network Abdul Rahim Ahmad

Unsupervised, comprises of 2-way interconnections between input nodes and a MAXNET. Classifies the incoming data into clusters. 1st example stored as example or model pattern, 2nd example compared to 1st: either same cluster or new example. How the differences are measured - the closeness measure. New example is compared with all current example in parallel. Adaptive Resonance Theory (ART) Abdul Rahim Ahmad

ART algorithm Abdul Rahim Ahmad

Unsupervised - also called self-organizing maps (SOMs). • Topology: • Processing nodes arranged in a 2-D array (Kohonen layer) • 1-D layer of input nodes, each input node connected to each node in the Kohonen layer. • Used to cluster together similar patterns. • Learning involves competition between the neurons to respond to a particular input vector. • Weights of “winner” set to generate a high output (~1) • Weights on nearby neurons (neighborhood) adjusted to be high. • Weights on the “losers” are unchanged. Kohonen Self-organizing Networks Abdul Rahim Ahmad

When the trained network is presented with an input pattern, one neuron in the Kohonen layer will produce an output larger than the others, and is said to have fired. When a second similar pattern is presented, the same neuron or one in its neighborhood will fire. As similar patterns cause topologically close neurons to fire, clustering of similar patterns is achieved. Kohonen Self-organizing Networks Abdul Rahim Ahmad

Can demonstrate by training the network using pairs of Cartesian coordinates - Distribution of the firing neurons corresponds with the Cartesian coordinates represented by the input. • Thus, if the input elements fall in the range between –1 and 1, then an input vector of (–0.9, 0.9) will cause a neuron close to one corner of the Kohonen layer to fire, while an input vector of (0.9, –0.9) would cause a neuron close to the opposite corner to fire. • Can form part of a hybrid network for supervised learning. • Can pass coordinates of the firing neuron in a SOM to an MLP • learning takes place in two distinct phases • First, the Kohonen self-organizing network learns, without supervision, to associate regions in the pattern space with clusters of neurons in the Kohonen layer. • Second, an MLP learns to associate the coordinates of the firing neuron in the Kohonen layer with the desired class. Kohonen Self-organizing Networks Abdul Rahim Ahmad

Unsupervised and feedforward overall architecture similar to a 3-layer perceptron (i.e:MLP with one hidden layer) The input & output neurons - similar to perceptron Neurons in the hidden layer is a symmetrical function - radial basis function (RBF). Radial Basis Function (RBF) networks Abdul Rahim Ahmad

The input neurons simply feed the input data into the nodes above. • The neurons in the output layer produce the weighted sum of their inputs, passed through a linear transfer function. • For an input vector (x1, x2, … xn), a neuron i in the hidden layer produces an output, yi, given by: • where wijare the weights on the inputs to neuron i, and fr is a radial basis function (RBF). RBF networks (cont.) Abdul Rahim Ahmad

The most commonly used RBF is a Gaussian function: Where σi is the standard deviation of a distribution described by the function Each neuron, i, in the hidden layer has its own separate value for σi Gaussian RBF Abdul Rahim Ahmad

If the set of weights (wi1, wi2, … win) on a given neuron i is treated as the coordinates of a point in pattern space, then ri is the Euclidean distance from there to the point represented by the input vector (x1, x2, … xn). During unsupervised learning, the network adjusts the weights — (centers in an RBF network ) - so that each point (wi1, wi2, … win) represents the center of a cluster of data points in pattern space. Training Abdul Rahim Ahmad

Sizes of the clusters is defined by adjusting the variables σi (or equivalent variables if an RBF other than the Gaussian is used). Data points within a certain range, e.g., 2σi from a cluster center might be deemed members of the cluster. RBF network can be thought of as drawing circles around clusters in 2-D space, or hypersheres in n-D space. One such cluster can be identified for each neuron in hidden layer. Training (cont.) Abdul Rahim Ahmad

Gaussian function in 2-D space • it can be seen that a fixed output value (e.g., 0.5) defines a circle in the pattern space. • Hidden layer : unsupervised learning – forming clusters. • Output layer : supervised learning - associate each cluster with a particular class. • Associate several circular clusters of varying center and size with a single class. Training (cont.) Abdul Rahim Ahmad

MITM 613 Intelligent System