1 / 27

Generalizing Backpropagation to Include Sparse Coding

Generalizing Backpropagation to Include Sparse Coding. David M. Bradley ( dbradley@cs.cmu.edu ) and Drew Bagnell. Robotics Institute Carnegie Mellon University. Outline. Discuss value of modular and deep gradient based systems, especially in robotics

paul
Download Presentation

Generalizing Backpropagation to Include Sparse Coding

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Generalizing Backpropagation to Include Sparse Coding David M. Bradley (dbradley@cs.cmu.edu) and Drew Bagnell Robotics Institute Carnegie Mellon University

  2. Outline • Discuss value of modular and deep gradient based systems, especially in robotics • Introduce a new and useful family of modules • Properties of new family • Online training with non-gaussian priors • E.g. encourage sparsity, multi-task weight sharing • Modules internally solve continuous optimization problems • Captures interesting nonlinear effects such as inhibition that involve coupled outputs • Sparse Approximation • Modules can be jointly optimized by a generalization of backpropagation

  3. Deep Modular Learning systems • Efficiently represent complex functions • Particularly efficient for closely related tasks • Recently shown to be powerful learning machines • Greedy layer-wise training improves initialization • Greedy module-wise training is useful for designing complex systems • Design and Initialize modules independently • Jointly optimize the final system with backpropagation • Gradient methods allow the incorporation of diverse data sources and losses G. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm for deep belief networks.”, Neural Computation 2006 Y. Bengio, P. Lamblin, H. Larochelle, “Greedy layer-wise training of deep networks.”, NIPS 2007 Y. LeCun, L. Bottou, Y. Bengio and P. Haffner: Gradient-Based Learning Applied to Document Recognition, 1998

  4. Mobile Robot Perception Ladar RGB Camera NIR Camera Lots of unlabeled data Hard to define traditional supervised learning data Target task is defined by weakly-labeled structured output data

  5. Perception Problem: Scene labeling Cost for each 2-D cell Motion Planner

  6. Goal System Data Flow Gradient Lighting Variance Cost Webcam Data Camera Object Classification Cost Labelme Labelme Proprioception Prediction Cost IMU data Labeled 3-D points Point Classifier Classification Cost Motion plans Max Margin Planner Human-Driven Example Paths Ground Plane Estimator Observed Wheel Heights Laser

  7. New Modules • Modules that are important in this system require two new abilities • Induce new priors on weights • Allow modules to solve internal optimization problems

  8. Standard Backpropagation assumes L2 prior • Gradient descent with convex loss functions: • Small steps with early stopping imply L2 regularization • Minimizes a regret bound by solving the optimization: • Which bounds the true regret M. Zinkevich, “Online Convex Programming and Generalized Infinitesimal Gradient Ascent”, ‘03

  9. Alternate Priors • KL-divergence • Useful if many features are irrelevant • Approximately solved with exponentiated gradient descent • multi-task priors (encourage sharing between related tasks) Argyriou and Evgeniou, “Multi-task Feature Learning”, NIPS 07 Bradley and Bagnell 2008

  10. L2 Backpropagation Loss Function + Module (M1) Module (M3) Loss Function a c Input Module (M2) b

  11. With KL prior modules Loss Function + Module (M1) Module (M3) Loss Function a c Input Module (M2) b

  12. General Mirror Descent Loss Function + Module (M1) Module (M3) Loss Function a c Input Module (M2) b

  13. New Modules • Modules that are important in this system require two new abilities • Induce new priors on weights • Allow modules to solve internal optimization problems • interesting nonlinear effects such as inhibition that involve coupled outputs • Sparse Approximation

  14. Inhibition Input Basis

  15. Inhibition Input Basis Projection

  16. Inhibition Input Basis KL-regularized Optimization

  17. Sparse Approximation • Assumes the input is a sparse combination of elements, plus observation noise • Many possible elements • Only a few present in any particular example • True for many real-world signals • Many applications • Compression (JPEG), Sensing (MRI), Machine Learning • Produces effects observed in biology • V1 receptive fields, Inhibition Tropp et al. “Algorithms For Simultaneous Sparse Approximation”, 2005 Raina et al. “Self Taught Learning: Transfer Learning from unlabeled data”, ICML ’07 Olhausen and Field, “Sparse Coding of Natural Images Produces Localized, Oriented, Bandpass Receptive Fields”, Nature 95 Doi and Lewicki, “Sparse Coding of natural images using an overcomplete set of limited capacity units”, NIPS 04

  18. Sparse Approximation Semantic meaning is sparse Visual Representation is Sparse (JPEG)

  19. MNIST Digits Dataset • 60,000 28x28 pixel handwritten digits • 10,000 reserved for a validation set • Separate 10,000 digit test set

  20. Sparse Approximation Basis Coefficients (w1) r1=Bw Error gradient Input Reconstruction Error (Cross Entropy)

  21. Sparse Approximation KL-regularized Coefficients on a KL-regularized Basis Input Output

  22. Sparse Coding Basis Coefficients (w(i)) r=Bw(i) Input Reconstruction Error (Cross Entropy) Training Examples Minimize over W and B

  23. Optimization Modules L1 Regularized Sparse Approximation L1 Regularized Sparse Coding Regularization Term Reconstruction Loss Convex Not Convex Lee et al. “Efficient Sparse Coding Algorithms”, NIPS '06

  24. KL-regularized Sparse Approximation Unnormalized KL Reconstruction Loss • Since this is continuous and differentiable, at the minimum we have: • Differentiating both sides with respect to B, and solving for the kth row we get:

  25. Preliminary Results L1 sparse coding KL improves classification performance Backpropagation further improves performance KL sparse coding with backpropagation

  26. Main Points • Modular, gradient based systems are an important design tool for large scale learning systems • Need new tools to include a family of modules that have important properties • Presented a generalized backpropagation technique that • Allow priors that encourage, e.g. sparsity (KL prior): uses mirror descent to modify weights • Uses implicit differentiation to compute gradients through modules (e.g. sparse approximation) that internally solve optimization • Demonstrated work-in-progress on building deep, sparse coders using generalized backpropagation

  27. Acknowledgements • The Authors would like to thank the UPI team, especially Cris Dima, David Silver, and Carl Wellington • DARPA and the Army Research Office for supporting this work through the UPI program and the NDSEG fellowship

More Related