Neural Network Implementations Back-propagation networks Learning vector quantizer networks Kohonen self-organizing feature map networks Evolutionary multi-layer perceptron networks
The Iris Data Set • Consists of 150 four-dimensional vectors (50 plants of each of three Iris species) • Features are: sepal length, sepal width, petal length and petal width • We are working with scaled values in the range [0,1] • Examples of patterns: • 0.637500 0.437500 0.175000 0.025000 1 0 0 • 0.875000 0.400000 0.587500 0.175000 0 1 0 • 0.787500 0.412500 0.750000 0.312500 0 0 1
Implementation Issues • Topology • Network initialization and normalization • Feedforward calculations • Supervised adaptation versus unsupervised adaptation • Issues in evolving neural networks
Topology • Pattern of PEs and interconnections • Direction of data flow • PE activation functions • Back-propagation uses at least three layers; LVQ and SOFM use two.
Definition: Neural Network Architecture Specifications sufficient to build, train, test, and operate a neural network
Back-propagation Networks • Software on web site • Topology • Network input • Feedforward calculations • Training • Choosing network parameters • Running the implementation
Elements of an artificial neuron (PE) • Set of connection weights • Linear combiner • Activation function
Back-propagation network input • Number of inputs depends on application • Don’t combine parameters unnecessarily • Inputs usually over range [0,1], continuous valued • Type float in C++: 24 bits value, 8 bits expon.; ~7 decimal places • Scaling usually used as a preprocessing tool • Usually scale on like groups of channels • Amplitude • Time
Feedforward Calculations • Input PEs distribute signal forward along multiple paths • Fully connected, in general • No feedback loop, not even self-feedback • Additive sigmoid PE is used in our implementation Activation of ith hidden PE: where fn(.) is the sigmoid function 0 is bias PE
Feedforward calculations, cont’d. • Sigmoid function performs job similar to electronic amplifier (gain is slope) • Once hidden layer activations are calculated, outputs are calculated:
Training by Error Back-propagation Error per pattern: Error_signalkj We derived this using the chain rule.
Backpropagation Training, Cont’d. • But we have to have weights initialized in order to update them. • Often (usually) randomize [-0.3, 0.3] • Two ways to update weights: • Online, or “single pattern” adaptation • Off-line, or epoch adaptation (we use this in our back-prop)
Updating Output Weights Basic weight update method: But this tends to get caught in local minima. So, introduce “momentum” α, [0,1] (includes bias weights)
Updating Hidden Weights As derived previously: So, Note: δ’s are calculated one pattern at a time, and are calculated using “old” weights.
Keep in mind… In offline training: The deltas are calculated pattern by pattern, while the weights are updated once per epoch. The values for η and α are usually assigned to the entire network, and left constant after good values are found. When the δ’s are calculated for the hidden layer, the old (existing) weights are used.
Kohonen Networks Probably second only to backpropagation in number of applications Rigorous mathematical derivation has not occurred Seem to be more biologically oriented than most paradigms Reduce dimensionality of inputs We’ll consider LVQI, LVQII, and Self-Organizing Feature Maps
Initial Weight Settings 1.Randomize weights [0,1]. 2. Normalize weights: • Note: Randomization often occurs in centroid area • of problem space.
Preprocessing Alternatives 1. Transform each variable onto [-1,1] 2. Then normalize by: a. Dividing each vector component by total length: or by b. “Z-axis normalization with a “synthetic” variable or by c. Assigning a fixed interval (perhaps 0.1 or 1/n, whichever is smaller) to a synthetic variable that is the scale factor in a. scaled to the fixed interval
Euclidean Distance for the j th PE, and the k th pattern
Distance Measures l = 1: Hamming distance l = 2: Euclidean distance l = 3: ???
Weight Updating Weights are adjusted in the neighborhood only Sometimes, where z = total no. of iterations Rule of thumb: No. of training iterations should be about 500 times the number of output PEs. * Some people start out with eta = 1 or near 1. * Initial neighborhood shoud include most or all of output PE field * Options exist for configuration of output slab: ring, cyl. surface, cube, etc.
Error Measurement *Unsupervised, so no “right” or “wrong” *Two approaches – pick or mix * Define error as mean error vector length * Define error as max error vector length (adding PE when this is large could improve performance) * Convergence metric: max_error_vector_length/eta (best when epoch training is used)
Learning Vector Quantizers: Outline • Introduction • Topology • Network initialization and input • Unsupervised training calculations • Giving the network a conscience • LVQII • The LVQI implementation
Learning Vector Quantization: Introduction • Related to SOFM • Several versions exist, both supervised and unsupervised • LVQI is unsupervised; LVQII is supervised (I & II do not correspond to Kohonen’s notation) • Related to perceptrons and delta rule, however : • * Only one (winner) PE’s weights updated • * Depending on version, updating is done for correct and/or • incorrect classification • * Weight updating method analogous to metric used to • pick winning PE for updating • * Network weight vectors approximate density function • of input
LVQI Network Initialization and Input • LVQI clusters input data • More common to input raw data (preprocessed) • Usually normalize input vectors, but sometimes better not to • Initial normalization of weight vectors almost always done, but in various ways • In implementation, for p PEs in output layer, first p patterns chosen randomly to initiate weights
Weight and Input Vector Initialization (a) before, (b) after, input vector normalization
LVQ Version I - Unsupervised Training • Present one pattern at a time, and select winning output PE based on minimum Euclidean distance • Update weights: • Continue until weight changes are acceptably small or max. iterations occur • Ideally, output will reflect probability distribution of input • But, what if we want to more accurately characterize the • decision hypersurface? • Important to have training patterns near decision hypersurface
Giving the Network a Conscience • The optimal 1/n representation by each output PE is unlikely (without some “help”) • This is especially serious when initial weights don’t reflect the probability distribution of the input patterns • DeSieno developed a method for adding a conscience to the network • In example: With no conscience, given uniform • distribution of input patterns, w7will win about • half of the time, other weights about 1/12 of • the time each.
Conscience Parameters • Conscience factor fj with initial value = 1/n • (so initial bias values are all 0) • Bias factor γ set approximately to 10 • Constant β set to about .0001 • (set β so that conscience factors don’t reflect noise in the data)
Example of Conscience If there are 5 output PEs, then 1/n = 0.2 = all initial fj values Biases are 0 initially, and first winner is selected based on Euclidean distance minimum Conscience factors are now updated: Winner’s fj = [0.2 + 0.0001(1.0 - 0.2)] = 0.20008 All others’ fj = 0.2 - 0.00002 = 0.19998 Winner’s bj = – .0008; all others’ bj = 0.0002
Probability Density Function Shows regions of equal area
Learning: No Conscience A = 0.03 for 16,000 iterations
Learning: With Conscience A = 0.03 for 16,000 iterations
LVQ - Version II - Supervised * Instantiate first pak vectors to weights wji * Relative numbers of weights assigned by class must correspond to a priori probabilities of classes * Assume pattern Ak belongs to class Cr and that the winning PE’s weight vector belongs to class Cs ; then for winning PE: For all other PEs, no weight changes are done. * This LVQ version reduces misclassifications
Evolving Neural Networks: Outline • Introduction and definitions • Artificial neural networks • Adaptation and computational intelligence • Advantages and disadvantages of previous approaches • Using particle swarm optimization (PSO) • An example application • Conclusions
Introduction • Neural networks are very good at some problems, such as mapping input vectors to outputs • Evolutionary algorithms are very good at other problems, such as optimization • Hybrid tools are possible that are better than either approach by itself • Review articles on evolving neural networks: Schaffer, Whitley, and Eshelman (1992); Yao (1995); and Fogel (1998) • Evolutionary algorithms usually used to evolve network weights, but sometimes used to evolve structures and/or learning algorithms
Typical Neural Network OUTPUTS INPUTS
Evolutionary Algorithms (EAs) Applied to NeuralNetwork Attributes • Network connection weights • Network topology (structure) • Network PE transfer function • Network learning algorithms
Early Approaches to Evolve Weights • Bremmerman (1968) suggested optimizing weights in multilayer neural networks. • Whitley (1989) used GA to learn weights in feedforward network; used for relatively small problems. • Montana and Davis (1989) used “steady state” GA to train 500-weight neural network. • Schaffer (1990) evolved a neural network with better generalization performance than one designed by human.
Evolution of Network Architecture • Most work has focused on evolving network topological structure • Less has been done on evolving processing element (PE) transfer functions • Very little has been done on evolving topological structure and PE transfer functions simultaneously
Examples of Approaches • Indirect coding schemes • Evolve parameters that specify network topology • Evolve number of PEs and/or number of hidden layers • Evolve developmental rules to construct network topology • Stork et al. (1990) evolved both network topology and PE transfer functions (Hodgkin-Huxley equation) for neuron in tail-flip circuitry of crayfish (only 7 PEs) • Koza and Rice (1991) used genetic programming to find weights and topology. They encoded a tree structure of Lisp S-expressions in the chromosome.
Examples of Approaches, Cont’d. • Optimization of EA operators used to evolve neural networks (optimize hill-climbing capabilities of GAs) • Summary: • Few quantitative comparisons with other approaches typically given (speed of computation, performance, generalization, etc.) • Comparisons should be between best available approaches (fast EAs versus fast NNs, for example)
Advantages of Previous Approaches • EAs can be used to train neural networks with non-differentiable PE transfer functions. • Not all PE transfer functions in a network need to be the same. • EAs can be used when error gradient or other error information is not available. • EAs can perform a global search in a problem space. • The fitness of a network evolved by an EA can be defined in a way appropriate for the problem. (The fitness function does not have to be continuous or differentiable.)