1 / 25

Using Fast Weights to Improve Persistent Contrastive Divergence

Using Fast Weights to Improve Persistent Contrastive Divergence. Presented by Tijmen Tieleman Joint work with Geoff Hinton University of Toronto. What this is about. An algorithm for unsupervised learning: modeling data density. Not classification or regression. Using Markov Random Fields

lauren
Download Presentation

Using Fast Weights to Improve Persistent Contrastive Divergence

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Fast Weights to Improve Persistent Contrastive Divergence Presented by Tijmen Tieleman Joint work with Geoff Hinton University of Toronto

  2. What this is about • An algorithm for unsupervised learning: modeling data density. Not classification or regression. • Using Markov Random Fields • Also known as: • Energy-based models • Log-linear models • Products of experts, or product models • Such as: Restricted Boltzmann Machines

  3. MRF Learning • Increase probability where training data is • Decrease probability when the model assigns much probability • Problem: finding those places (sampling) is intractable • CD or PL: use surrogate samples, a few steps away from the training data. • It works  Best application paper etc… • But it’s not perfect

  4. The problem with CD

  5. The problem with CD

  6. PCD (part 1) • Gradient descent is iterative. • We can reuse data from the previous gradient estimate. • Use a Markov Chain for getting samples. • Plan: keep the Markov Chain close to equilibrium. • Do a few transitions after each weight update. • Thus the Chain catches up after the model changes. • Do not reset the Markov Chain after a weight update (hence ‘Persistent’ CD). • Thus we always have samples from very close to the model.

  7. PCD (part 2) • If we would not change the model at all, we would have exact samples (after burn-in). It would be a regular Markov Chain. • The model changes slightly, • So the Markov Chain is always a little behind.

  8. PCD Pseudocode • Initialize 100 Markov Chains, negData, arbitrarily • Initialize the model, theta, with small random weights • Repeat: • Get positive gradient on a batch of training data • Get negative gradient on negData. • Do learning on theta using the difference of gradients. • Update negData using a Gibbs update on the current model.

  9. Let’s take a step back • Gradient descent is iterative. • We can reuse data from the previous estimate. • Use a Markov Chain for getting samples. • Plan: keep the Markov Chain close to equilibrium. • Do a few transitions after each weight update. • Thus the Chain catches up after the model changes. • Do not reset the Markov Chain after a weight update (hence ‘Persistent’ CD). • Thus we always have samples from very close to the model.

  10. Really? • Gradient descent is iterative. • We can reuse data from the previous estimate. • Use a Markov Chain for getting samples. • Plan: keep the Markov Chain close to equilibrium. • Do a few transitions after each weight update. • Thus the Chain catches up after the model changes. • Do not reset the Markov Chain after a weight update (hence ‘Persistent’ CD). • Thus we always have samples from very close to the model.

  11. The mixing rate • Markov Chain (i.e. PCD with learning rate 0)

  12. Learning rate: 0.00003 The mixing rate

  13. Learning rate: 0.0001 The mixing rate

  14. Learning rate: 0.0003 The mixing rate

  15. Learning rate: 1 The mixing rate

  16. Learning rate: 3 The mixing rate

  17. Learning rate: 10 The mixing rate

  18. Learning accelerates mixing • Negative phase: “wherever negData is, reduce probability (do unlearning)” • Suddenly, those negData Markov Chains are in a low probability area • Therefore, they quickly move away, to a higher probability area • Repeat… repeat… • Similar but deterministic: “Herding Dynamical Weights to Learn” by Max Welling (poster today)

  19. New idea • Learning makes the chain mix • Faster learning makes the chain mix faster… • …but going too fast will mess up the learning • Keep separate ‘fast’ weights that learn rapidly • Keep them close to the regular weights • The chains mix using the fast weights • The chains provide good samples for learning • On the regular weights, the learning rate is small.

  20. FPCD Pseudocode • Initialization: • regular theta = small random weights. • fast theta = all zeros. • Initialize negData arbitrarily. • Repeat: • Get the positive gradient on a batch of training data. • Get the negative gradient on negData. • Do learning on both regular theta and fast theta using the difference of gradients (but with different LR’s). • Update negData using a Gibbs update, with parameters regular theta + fast theta. • Update: fast thetafast theta * 0.95

  21. Additional notes • When fast theta is all zeros, FPCD == PCD. • There should be different learning rates for regular theta and fast theta. • Regular theta must learn slowly, to prevent getting bad models. • Fast theta must learn rapidly, to enable fast mixing.

  22. Results • FPCD helps a lot… • …when not many updates can be done (i.e. large data dimensionality). • For small models (e.g. toy) with lots of training time  same as PCD. • More results: poster.

  23. Conclusion • FPCD = Fast PCD = Persistent Contrastive Divergence using Fast weights • Use fast learning-causing-mixing, but not on the regular model parameters.

  24. The End

  25. The mixing rate • Markov Chain (i.e. PCD with learning rate 0)

More Related