1 / 21

Nens220: Lecture 5

Nens220: Lecture 5. Neural Networks part 2. Topics. Dayan and Abbott chapters 9,10 Function approximation: Radial Basis Functions Density estimation: kernel methods Clustering: Expectation Maximization, k-means, and Kohonen networks Reinforcement Learning Independent Components Analysis.

cosmo
Download Presentation

Nens220: Lecture 5

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Nens220: Lecture 5 Neural Networks part 2

  2. Topics • Dayan and Abbott chapters 9,10 • Function approximation: Radial Basis Functions • Density estimation: kernel methods • Clustering: Expectation Maximization, k-means, and Kohonen networks • Reinforcement Learning • Independent Components Analysis

  3. Cortical map plasticity • Brain area dedicated to a function increases with the importance of that function. Why? • Need models of allocation of resources. • Information theory tells us the optimal answer, but does not tell us the algorithms needed to find it. • Radial basis functions and kernel methods are models for this process…

  4. Nonlinear networks • Inputs are a nonlinear function y w f(x) x Learning the functions f(x) is “feature extraction”.

  5. Radial basis functions • The functions f(x) are called “basis functions”. If they are radially symmetric then they are “radial basis functions” Example: Gaussians… Equivalent to filtering and subsampling.

  6. The curse of dimensionality • If x has D dimensions and you use N basis functions per dimension, then you need ND basis functions. If D>20, this is almost always impossible, no matter how small N is. • Solution: There isn’t that much data anyway. You never need more basis functions than data points.

  7. RBF solution • Place the centers mi at the locations of each data point xi. • Make the widths ai reasonable (but usually want smaller than the distance between data points). • Make the weights wi equal to the desired value of the output yi.

  8. Why is this good? • You can represent any desired function y=g(x) to arbitrary accuracy if you have enough rbfs. • Learning is easy (sometimes). One-shot memorization. Minimal “un-learning”. • Generalization is often poor.

  9. Kernel Density Estimation • Similar to RBFs • For each datapoint xi add a new “kernel” function at that point: • Result is a convolution with the data points (same as spike rate reconstruction…)

  10. Clustering algorithms • Set of categories r=A, B, C, etc… • Set of expected data for each category p(x|r) • Any given observation x has a probability of being in category r: p(r|x) • Goal is to figure out what the categories are by observing values of x. • “Mixture model”:

  11. K-means algorithm • Start with k centers kr ; each r is one category. • New points xi are assigned the category of the closest center. • Move each center to the mean of all its assigned points. • Then need to re-categorize all points every time the mean changes.

  12. Kohonen network • How to do K-means without having to remember all the data points. • “Winner-take-all” network: Every time a data point arrives, only the closest mean will change:

  13. Expectation Maximization • Need a set of parameterized functions to represent p(x|r) [eg: Gaussians] • When a data point xi arrives, calculate p(r|x) for each category r using current gaussians. • Use Bayes’ rule to find p(x|r)=p(r|x)p(x)/p(r) for each r. • Adjust parameters [mean, variance] at each step, using the new empirical p(x|r).

  14. EM Convergence • If the data is well-represented by a sum of Gaussian bumps, then this will probably converge to the correct centers and widths of the bumps. • Problem is to know the shape and number of the kernels in advance.

  15. Map plasticity • RBFs, kde, K-means, and EM all place more bumps wherever there is more data. • NB: RBFs are a supervised algorithm, others are unsupervised. • May be models for cortical map formation and “cortical magnification” • If each bump is a cell, then the cell responses will cluster near regions of high data density. • This is also the correct solution for information theory.

  16. Reinforcement Learning • Neither supervised nor unsupervised • There is a reward signal R(x) that is a function of the current state x. • Goal is to maximize the future reward • Try to predict current reward: R(x)=wTx • Try to predict future reward: V(t)=wTx(t)

  17. Predicting Reward • Can learn R(x) using the LMS rule • To learn V(t), you need to know what you will do in the future (x(t+1), x(t+2), etc…) • So you need a “Policy” x(t+1)=f(x(t)) • Ask: what is expected total future reward for this particular policy, if I start in state x(t)? • If I have a choice of state x(t), which is best? • Then: how do I optimize the policy?

  18. Temporal Difference Learning • Estimate the value of each state V(x), assuming you follow your policy • Use LMS rule, but target is • Now, use a “greedy” policy and always choose the next state with the highest value.

  19. Backing up • TD “backs up” the value from the goal toward earlier states. • Allows you to predict the future effects of current actions. • May be a model for reinforcement and learning role of dopamine.

  20. Independent Components Analysis • A linear procedure, like PCA: y=Wx • But not interested in maximizing variance; goal is to make outputs independent (if possible): • ICA = PCA for Gaussians. Only interesting for non-Gaussian distributions.

  21. ICA x Y=Wx

More Related