1 / 25

Reinforcement Learning Dealing with Complexity and Safety in RL

Reinforcement Learning Dealing with Complexity and Safety in RL. Subramanian Ramamoorthy School of Informatics 27 March, 2012. (Why) Isn’t RL Deployed More Widely?.

tyson
Download Presentation

Reinforcement Learning Dealing with Complexity and Safety in RL

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reinforcement LearningDealing with Complexity and Safety in RL Subramanian Ramamoorthy School of Informatics 27 March, 2012

  2. (Why) Isn’t RL Deployed More Widely? Very interesting discussion at: http://umichrl.pbworks.com/w/page/7597585/Myths%20of%20Reinforcement%20Learning, maintained by Satinder Singh • Negative views/myths: RL is hard due to dimensionality, partial observability, function approximation, etc. etc. • Positive view: There is no getting away from the fact that RL is the proper statement of the “agent’s problem”. So, the question is really one of how to solve it! Reinforcement Learning

  3. A Provocative Claim “The (PO)MDP frameworks are fundamentally broken, not because they are insufficiently powerful representations, but because they are too powerful. We submit that, rather than generalizing these models, we should be specializing them if we want to make progress on solving real problems in the real world.” T. Lane, W.D. Smart, Why (PO)MDPs Lose for Spatial Tasks and What to Do About It, ICML Workshop on Rich Representations for RL, 2005. Reinforcement Learning

  4. What is the Issue? (Lane et al.) • In our efforts to formalize the notion of “learning control”, we have striven to construct ever more general and, putatively, powerful models. By the mid-1990s we had (with a little bit of blatant “borrowing” from the Operations Research community) arrived at the (PO)MDP formalism (Puterman, 1994) and grounded our RL methods in it (Sutton & Barto, 1998; Kaelbling et al., 1996; Kaelbling et al., 1998). • These models are mathematically elegant, have enabled precise descriptions and analysis of a wide array of RL algorithms, and are incredibly general. We argue, however, that their very generality is a hindrance in many practical cases. • In their generality, these models have discarded the very qualities — metric, topology, scale, etc. — that have proven to be so valuable for many, many science and engineering disciplines. Reinforcement Learning

  5. What is Missing in POMDPs? • POMDPs do not describe natural metrics in environment • When driving, we know both global and local distances • POMDPs do not natively recognize differences between scales • Uncertainty in control is entirely different from uncertainty in routing • POMDPs conflate properties of the environment with properties of the agent • Roads and buildings behave differently from cars and pedestrians: we need to generalize over them differently • POMDPs are defined in a global coordinate frame, often discrete! • We may need many different representations in real problems Reinforcement Learning

  6. Specific Insight #1 Metric of a space imposes a “speed limit” on the agent — the agent cannot transition to arbitrary points in the environment in a single step. Consequences: • Agent can neglect large parts of the state space when planning. • More importantly, however, this result implies that control experience can be generalized across regions of the state space. • If the agent learns a good policy for one bounded region of the state space, and it can find a second bounded region that is homeomorphic to the first. Metric envelope bound for point-to-point navigation in an open-space gridworld environment. The outer region is the elliptical envelope that contains 90% of the trajectory probability mass. The inner, darker region is the set of states occupied by an agent in a total of 10,000 steps of experience (319 trajectories from bottom to top). Reinforcement Learning

  7. Insight #2: Manifold Representations • Informally, a manifold representation models the domain of the value function using a set of overlapping local regions, called charts. • Each chart has a local coordinate frame, is a (topological) disk, and has a (local) Euclidean distance metric. The collection of charts and their overlap regions is called a manifold. • We can embed partial value functions (and other models) on these charts, and combine them, using the theory of manifolds, to provide a global value function (or model). 13 eq. classes. If you consider Rotational symmetry, Only 4 classes. Reinforcement Learning

  8. What Makes Some POMDP Problems Easy to Approximate? David Hsu, Wee Sun Lee, Nan Rong, NIPS 2007 Reinforcement Learning

  9. Understanding Why PBVI Works • Point-based algorithms have been surprisingly successful in computing approximately optimal solutions for POMDPs. • What are the belief-space properties that allow some POMDP problems to be approximated efficiently, explaining the point-based algorithms’ success? Reinforcement Learning

  10. Hardness of POMDPs • Intractability due to curse of dimensionality • Size of belief space grows exponentially with state space, |S| • But, in recent years, good progress has been made in sampling the belief space and approximating solutions • Hsu et al. refer to solutions to a POMDP with hundreds of states in seconds • Tag problem: robot needs to search for a moving tag (whose position is unobserved except when robot bumps into it), ~870-dim space • Solved using PBVI methods in <1 minute Reinforcement Learning

  11. Initial Observation • Many point-based algorithms only explore a subset of the belief space, , the reachable space • The reachable space contains all points reachable from a given initial belief point b0 under arbitrary sequences of actions and observations • Is the reason for PBVI’s success that reachable space is small? • Not always: Tag has approx. 860-dim reachable space. Reinforcement Learning

  12. Covering Number • Covering number of a space is the minimum number of given size balls that needed to cover the space fully • Hsu et al. show that an approximately optimal POMDP solution can be computed in time polynomial in the covering number of R(b0) • Covering number also reveals that the belief space for Tag behaves more like the union of some 29-dimensional spaces rather than an 870-dimensional space, as the robot’s position is fully observed. Reinforcement Learning

  13. Further Questions • Is it possible to compute an approximate solution efficiently under the weaker condition of having a small covering number for an optimal reachable R*(b0), which contains only points in B reachable from b0 under an optimal policy? • Unfortunately, this problem is NP-hard. The problem remains NP-hard, even if the optimal policies have a compact piecewise-linear representation using a-vectors. • However, given a suitable set of points that “cover” R*(b0) well, a good approximate solution can be computed in polynomial time. • Using sampling to approximate an optimal reachable space, and not just the reachable space, may be a promising approach in practice. Reinforcement Learning

  14. Lyapunov Design for Safe Reinforcement Learning Theodore J. Perkins and Andrew G. Barto, JMLR 2002 Reinforcement Learning

  15. Dynamical Systems • Dynamical systems can be described by states and evolution of states over time • The evolution of states is constrained by dynamics of the system • In other words, dynamical systems are mappings from current state to next state • If the mapping is a contraction, the state will eventually converge to a fixed point Reinforcement Learning

  16. Reinforcement Learning – Traditional Methods • The target or goal state may not be a natural attractor • Hypothesis: Learning is easier if target is a fixed point, e.g., TD-Gammon • People have tried to embed domain knowledge in various ways: • Known good actions are specified • Sub-goals are explicitly specified Reinforcement Learning

  17. Key Idea • Use Lyapunov functions to constrain action selection • This forces the RL agent to move towards the goal • e.g., consider grid world, finite steps if Lyapunov constrained: Reinforcement Learning

  18. Problem Setup • Deterministic dynamical system • Evolution according to MDP, Reinforcement Learning

  19. Lyapunov Functions • Generalized energy functions Reinforcement Learning

  20. Pendulum Problem Reinforcement Learning

  21. Results 1 • AEA,AAll had shorter trials than Aconst • AEA outperformed AAll, especially at fine resolutions of discretization • AEA trial times seemed independent of binning • AConst alone never worked Note: Theorem guarantees that AEA monotonically increases energy. Reinforcement Learning

  22. Results 2 1: AEA, G2 2: AAll, G2 3: AConst, G2 4: AAll + sat LQR, G1 Reinforcement Learning

  23. Stochastic Case Reinforcement Learning

  24. Results – Stochastic Case Reinforcement Learning

  25. Some Open Questions • How can you improve performance using less sophisticated ‘primitive’ actions? • Perkins and Barto use deep intuition to design local laws, e.g., to avoid undesired gravity-control equilibria. How do we deal with this when the dynamics is less understood? • Stochastic cases have rather weak guarantees. How can they be improved? Reinforcement Learning

More Related