1 / 53

Presentation next week: cerebellum and supervised learning

Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors in arm movements. Nature. 1998 392(6675):494-7. Motor learning: learning algorithms - network and distributed representations

alec
Download Presentation

Presentation next week: cerebellum and supervised learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors in arm movements. Nature. 1998 392(6675):494-7.

  2. Motor learning: learning algorithms - network and distributed representations - supervised learning - perceptrons and LMS - backpropagation - reinforcement learning - unsupervised learning - Hebbian networks

  3. Motor learning - supervised learning - knowledge of desired behavior is specified y x i.e. for every input x, we know the corresponding desired output y

  4. Motor learning - supervised learning e.g. learning mapping between joint configuration and end point Vision gives you information about both values (or could use proprioception for joint angles)

  5. Motor learning - supervised learning - limited feedback from the periphery - just get a ‘good’ or ‘bad’ evaluation - have to adjust behavior to maximize ‘good’ evaluation => reinforcement learning e.g. maze learning Sequence of actions leads to a reward - how do we learn the appropriate sequence?

  6. Motor learning - unsupervised learning - no feedback from the periphery - rely on statistics of the inputs (or outputs) to find structure in the data e.g. clustering of data x2 x1

  7. Motor learning - unsupervised learning - no feedback from the periphery - rely on statistics of the inputs (or outputs) to find structure in the data e.g. clustering of data x2 x1 Develop representations based on properties of the data

  8. Motor learning - supervised motor learning - parameterized models - non-parametric, ‘neural network’ models - reinforcement learning - unsupervised learning - Hebbian learning - principle components analysis - independent components analysis

  9. Supervised motor learning - learning parameterized models linear regression We know the general structure of the model: y = a*x + b but we don’t know parameters a or b. y x We want to estimate a and b based on paired data sets {xi} and {yi}

  10. Parameterized models Linear regression y = a*x + b analytical solution (Intro stats): b = Sxiyi/Sxixi a = <y> - b<x> <x> is the expected value of x, i.e. the mean y x This is from Intro stats – single step of calculation across all data

  11. Parameterized models Linear regression using iterative gradient descent y* = a* x + b*; a*, b* the correct parameters, y* is the observed data assume initial parameters a and b, anddefine an error term: E = 1/2(y – y*)2; y is the value predicted by the current parameters y is the target value we want to find parameters which minimize this error - move the parameters to reduce the errors a = a + da; da is the change in a to reduce the error b = b + db; db is the change in b to reduce the error choose da, db along the gradient of the error

  12. Parameterized models y* = a* x + b* E = 1/2(y – y*)2 find the gradient of the error wrt to the parameters: dE/da = -(y – y*)dy*/da; = -(y – y*)x; dE/db = -(y – y*); choose a = a - m(y – y*)x; b = b - m(y – y*); with 0< m < 1 to control speed of learning

  13. Parameterized models e.g. iterative gradient descent for linear regression y x

  14. Parameterized models learn limb parameters for 2dof: x = l1*cos(q1) + l2*cos(q1+q2) y = l1*sin(q1) - l2*sin(q1+q2) y x

  15. Motor learning and representations - how are properties of the limb represented by the CNS? Distributed representations - parameters are not explicitly fit - both the parameters and the model structure are identified end position angle Learn parameters and model within a distributed network

  16. Distributed models - network architecture w11 1 1 w12 w21 2 2 x y w22 W … 1 1 inputs outputs 2 2 y1 = w11*x1 + w21*x2 y2 = w12*x1 + w22*x2 => y = Wx as shown here, this is just linear regression

  17. Distributed network models simple network one layer linear units x y W 1 1 inputs outputs 2 2 3 y = Wx from inputs x and corresponding outputs y*, find W that best approximates the function

  18. Distributed network models To fit the network parameters: define error: E = ½(y – y*)2 take derivitive wrt weights: dE/dW = -(y - y*)xT update weights: W = W - u (y - y*)xT or changing weight by weight: wij = wij - u (yj - yj*)xiT i.e. similar to the rule for linear regression simple network one layer linear units x y W 1 1 inputs outputs 2 2 3 this is Widrow-Hoff/ adaline/ LMS rule - least mean squares rule

  19. Distributed network models - linear units, single layer networks batch mode: learn from all the data at once W = W + udE/dW online mode: learn from each data point at a time wi = wi + udEi/dwi; for {xi,yi}, the ith data point x y W 1 1 inputs outputs 2 2 3

  20. Distributed network models • linear units, single layer networks • - essentially linear regression • gradient descent learning rule leads to • LMS update rule to change weights • iteratively x y W 1 1 inputs outputs 2 2 3

  21. Distributed network models • more complicated computations • classification: learn to assign data points to correct category x2 x1

  22. Distributed network models • more complicated computations • classification: learn to assign data points to correct category x2 y = 1 y = -1 x1 We want to classify the inputs (x) to outputs of either y = {-1,1} i.e. categorize the data

  23. Distributed network models • more complicated computations • classification: learn to assign data points to correct category w x2 y = w*x > 0 x1 y = w*x < 0 The weight vector acts to project the inputs to produce the outputs - if we take y = sign(w*x), we’re can do classification

  24. Distributed network models - categorization (non-linear transformation) Learning in nonlinear networks - outputs are non-linear function of their inputs: sigmoidal ‘squashing’ function g(Wx) = 1/(1 - exp(Wx)) x y W 1 1 2 category (0,1) patterns 3 g(Wx) Wx works like a ‘bistable’ categorization unit can also use g(x) = sign(x) (Perceptrons)

  25. Distributed network models - categorization (non-linear transformation) Learning in nonlinear networks y = g(Wx) = 1/(1 - exp(Wx)) Find the gradient: E = ½(y – y*)2 dE/dw = -(y - y*) g’(Wx) x note that: g’(z) = g(z)(1 – g(z)) x y W 1 1 2 category (0,1) patterns 3 this is the basic neural network learning rule

  26. Distributed network models • non-linear units, single layer networks • - ‘logistic’, non-linear regression • allows learning of categorization problems x y W 1 1 patterns category (0,1) 2 3

  27. Distributed network models - single layer, classification networks find a network to perform logical AND function x2 • x1 x2 y • 0 0 0 • 0 1 0 • 0 0 • 1 1 1 x1

  28. Distributed network models - single layer, classification networks find a network to perform AND logical function x2 • x1 x2 y • 0 0 0 • 0 1 0 • 0 0 • 1 1 1 x1

  29. x2 x1 Distributed network models - single layer, classification networks logical AND x - choose W = [10 10] x1 x2 - need an offset to the inputs to shift the origin -.6 -.6 W 1 1 y y • x1 x2 Wx threshold(y) • 0 0 -1.2 0 • 0 1 -.2 0 • 0 -.2 0 • 1 1 .8 1

  30. Distributed network models - single layer, classification networks find a network to perform logical XOR function x2 • x1 x2 y • 0 0 0 • 0 1 1 • 0 1 • 1 1 0 x1 What weights will make this work? • there are none single layer networks are computationally limited

  31. Distributed network models - multiple layer networks XOR can be solved with multi-layered network x x1 x2 1 1 1 1 • x1 x2 h1 h2 Wh y • 0 0 0 0 0 0 • 0 1 1 0 1 1 • 0 1 0 1 1 • 1 1 1 1 -1 0 h1 h2 -.5 -1.5 1 -2 y -.5 y • more complicated computations can be performed with multiple layer networks • - can characterize problems which are not linearly separable

  32. Distributed network models - learning in multiple layer networks y x h W V 1 1 1 outputs inputs 2 2 2 Consider a linear network: h = Wx y = Vh NB: there’s not much point to multiple layers with linear units since it can all be reexpressed as a single linear network: y = VWx = W’x; i.e. just redefine your weight matrix

  33. x h W V 1 1 1 outputs inputs 2 2 2 Distributed network models - learning in multiple layer networks y linear network: h = Wx y = Vh Form the error: E = ½(y – y*)2 To update the weights V, from h to y: dE/dV = (y - y*) dy/dV = (y – y*) h i.e. the same rule as for the single layer network

  34. x h W V 1 1 1 outputs inputs 2 2 2 Distributed network models - learning in multiple layer networks y linear network: h = Wx y = Vh To update the weights W, from x to h use chain rule: dE/dW = (y - y*) dy/dW; = (y - y*) dy/dh dh/dW; = (y – y*) V x - this is the gradient for the ‘hidden’ layer

  35. x h W V 1 1 1 outputs inputs 2 2 2 Distributed network models - learning in multiple layer networks y non-linear network: h = g(Wx) y = g(Vh) Updating weights V is same as before: dE/dV = (y – y*) g’(Vh) h

  36. x h W V 1 1 1 outputs inputs 2 2 2 Distributed network models - learning in multiple layer networks y To update the weights W use chain rule: dE/dW = (y - y*) dy/dW; = (y - y*) dy/dh dh/dW; = (y – y*) g’(Vh) V g’(Wx) x Essentially, we’re propagating the error backwards through the network, changing weights according to how much they affect the output => Backpropagation learning

  37. Distributed network models - backpropagation learning in multiple layer networks y x h W V 1 1 1 linear network: h = Wx y = Vh outputs inputs 2 2 2 • Find out how much of the error in output is due to V • - the responsibility will be due to the activity of h: dE/dV = (y-y*)h • - change V according to this responsibility • Find out how much of the error is due to W • - units in h which have a large output weight V will be more • responsible for the error (i.e. weight error by V): (y-y*) V • - values in h will be due to activities in x (i.e. weight h responsibility • by x): dE/dW =(y-y*) V x • - change W according to this ‘accumulated’ responsibility

  38. Learning in multi-layer neural networks • - backpropagation learning • allows for simple learning of arbitrarily complex input/output mappings • with enough ‘neurons’, most any mapping is possible • results in ‘distributed’ representations • knowledge of the mapping is distributed across neuronal populations • not individual cells • changes in restricted regions of the input state space will result in • restricted changes of the output

  39. Learning in multi-layer neural networks • - backpropagation learning • - much slower than paramaterized models • - network needs to estimate the parameters and model structure • from scratch • - convergence can be slow, especially if the error surface is shallow • - speed can be increased by altering the learning rate (annealing) • or by using conjugate gradient descent • - or with ‘momentum’ • W = W – udE/dW – n< change in W last time> error parameters

  40. Learning in multi-layer neural networks • - backpropagation learning • - local minima • - error surface might have small ‘basins’ which can trap the network error global mininum local mininum parameters • Start the network in different initial conditions to find the global mininum

  41. Learning in multi-layer neural networks • - backpropagation learning • - Choosing the learning rate • - small values for u can take long time for network to converge • - large values can lead to instability + learning rate too high learning rate ok

  42. Motor learning: learning algorithms • - gradient descent • - change model parameters to reduce error in prediction • - parameterized models • - non-parametric models • - single layer, linear and non-linear networks • - LMS/adaline learning rules • - multi layer, non-linear networks • - back propagation learning • in all of the above, we knew the correct answer and tried to match it • - i.e. ‘supervised learning’ • But what if our knowledge of outcome is limited? • => reinforcement learning

  43. Reinforcement learning - supervised learning, but with limited feedback evaluation: {good, bad} 1 1 environment 2 2 3 inputs outputs The environment sends back a global signal saying good or bad (1 or -1) depending on system performance e.g. move the limb and bump into things (pain as a reinforcer)

  44. Reinforcement learning - supervised learning, but with limited feedback Using a global reinforcement signal to train a network Basic idea - start with initial network - produce an output based on a given inpuyt - but add noise to the network to explore - evaluate the output - find those units with large activity - change weights so that they’ll be large the next time the input is given

  45. Reinforcement learning - supervised learning, but with limited feedback Using a global reinforcement signal to train a network Associative reward-penalty algorithm (AR-P) Consider probabilistic outputs: p(y) = 1/(1+exp(-Wx)) The output produced on any given trial is therefore stochastic, with expected value determined by the sigmoid: <y> = tanh(-Wx) We then use gradient descent to get the update rule: dW = { u+ (y - <y>) W, if r is reward u- (-y - <y>) W, if r is penalty y x W 1 1 2 2 3

  46. Reinforcement learning - supervised learning, but with limited feedback Using a global reinforcement signal to train a network Associative reward-penalty algorithm (AR-P) • dW = { u+ (y - <y>) W, if r is reward • u- (-y - <y>) W, if r is penalty • 1) if expected value is close to what it • actually did, then don’t change things • (nothing new) • 2) if expected value is different from what • it did, and it was rewarded, then • change W so that it will do it again • 3) if expected value is different from what • it did, and it was penalized, then • change W so that it won’t do it again • => trial and error learning y x W 1 1 2 2 3

  47. Reinforcement learning - supervised learning, but with limited feedback Using a global reinforcement signal to train a network Much slower than gradient descent More biologically plausible - how is error backpropagated in supervised learning? - more directly ethologically plausible - based on direct reward/penalty feedback - information about survival

  48. Motor learning: learning algorithms • - gradient descent • - change model parameters to reduce error in prediction • - parameterized models • - non-parametric models • - single layer, linear and non-linear networks • - LMS/adaline learning rules • - multi layer, non-linear networks • - back propagation learning • reinforcement learning • - AR-P networks • - Q learning, TD learning, dynamic programming • - unsupervised learning

More Related