1 / 36

Tópicos Especiais em Aprendizagem

Tópicos Especiais em Aprendizagem. Prof. Reinaldo Bianchi Centro Universitário da FEI 2012. Objetivo desta Aula. Esta é uma aula mais informativa. Aprendizado por Reforço: Planning and Learning . Relational Reinforcement Learning Uso de Heurísticas para acelerar o AR .

shen
Download Presentation

Tópicos Especiais em Aprendizagem

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012

  2. Objetivo desta Aula Esta é uma aula mais informativa • Aprendizado por Reforço: • Planning and Learning. • RelationalReinforcement Learning • Uso de Heurísticas para acelerar o AR. • DimensionsofRL - Conclusões. • Aula de hoje: • Capítulos 9 e 10 do Sutton & Barto. • Artigo Relational RL, MLJ, de 2001 • Tese de Doutorado Bianchi.

  3. Planning and Learning Capítulo 9 do Sutton e Barto

  4. Objetivos • Use of environment models. • Integration of planning and learning methods.

  5. Models • Model: anything the agent can use to predict how the environment will respond to its actions • Distribution model: description of all possibilities and their probabilities • e.g., • Sample model: produces sample experiences • e.g., a simulation model • Both types of models can be used to produce simulated experience • Often sample models are much easier to come by

  6. Planning • Planning: any computational process that uses a model to create or improve a policy. • Planning in AI: • state-space planning • plan-space planning (e.g., partial-order planner)

  7. Planning in RL • We take the following (unusual) view: • all state-space planning methods involve computing value functions, either explicitly or implicitly. • they all apply backups to simulated experience.

  8. Planning Cont. • Classical DP methods are state-space planning methods. • Heuristic search methods are state-space planning methods.

  9. Q-Learning Planning • A planning method based on Q-learning: Random-Sample One-Step Tabular Q-Planning

  10. Learning, Planning, and Acting • Two uses of real experience: • model learning: to improve the model • direct RL: to directly improve the value function and policy • Improving value function and/or policy via a model is sometimes called indirect RL or model-based RL. Here,we call it planning.

  11. Indirect methods: make fuller use of experience: get better policy with fewer environment interactions Direct methods simpler not affected by bad models Direct vs. Indirect RL But they are very closely related and can be usefully combined: - planning, acting, model learning, and direct RL can occur simultaneously and in parallel

  12. The Dyna Architecture (Sutton 1990)

  13. The Dyna-Q Algorithm direct RL model learning planning

  14. Dyna-Q on a Simple Maze rewards = 0 until goal, when =1

  15. Dyna-Q Snapshots: Midway in 2nd Episode

  16. Prioritized Sweeping • Which states or state-action pairs should be generated during planning? • Work backwards from states whose values have just changed: • Maintain a queue of state-action pairs whose values would change a lot if backed up, prioritized by the size of the change • When a new backup occurs, insert predecessors according to their priorities • Always perform backups from first in queue • Moore and Atkeson 1993; Peng and Williams, 1993

  17. Prioritized Sweeping

  18. Prioritized Sweeping vs. Dyna-Q Both use N=5 backups per environmental interaction

  19. Summary • Emphasized close relationship between planning and learning • Important distinction between distribution models and sample models • Looked at some ways to integrate planning and learning • synergy among planning, acting, model learning

  20. Summary • Distribution of backups: focus of the computation • trajectory sampling: backup along trajectories • prioritized sweeping • heuristic search • Size of backups: full vs. sample; deep vs. shallow

  21. Relational Reinforcement Learning Baseadoem aula de TayfunGürel

  22. Relational Representations • In most applications the state space is too large. • Generalization over states is essential • Many states are similar for some aspects • Representations for RL have to be enriched for generalization • RRL was proposed (Dzeroski, DeRaedt, Blockeel 1998)

  23. Blocks World An action: move(a,b) precondition: clear(a)  S clear(b)  S s1={clear(b), clear(a), on(b,c), on(c,floor), on(a,floor) }

  24. Relational Representations • With Relational Representations for states: • Abstraction from details q_value(0.72) = goal_unstack , numberofblocks(A) , action_move(B,C), height(D,E), E=2, on(C,D),!. • Flexible to goal changes • Retraining from the beginning is not necessary • Transfer of experience to more complex domains

  25. Relational Reinforcement Learning • How does it work? • An integration of RF with ILP: • Do forever: • Use Q-learning to generate sample Q values for sample states action pairs • Generalize them using ILP (in this case TILDE)

  26. TILDE (Top Down Induction of LogicalDecision Trees ) • A generalization of Q values is represented by a logical decision tree • Logical Decision Trees: • Nodes are First Order Logic atoms (Prolog Queries as Tests ) • e.g. on (A , c) : is there any block on c • Training Data is a relational database or a Prolog knowledge base

  27. Logical Decision Tree vs. Decision Tree Decision Tree and Logical Decision Tree deciding whether blocks are stacked

  28. TILDE Algorithm Declarative bias: e.g. on(+,-) Background knowledge: A prolog program: An example part of the program can be:

  29. Q-RRL Algorithm Examples generated by Q-RR learning

  30. Logical regression tree generated by TILDE-RT Equivalent prolog program

  31. EXPERIMENTS • Tested for three different goals: • 1. one-stack • 2. on(a,b) • 3. unstack • Tested for the following as well: • Fixed number of blocks • Number of blocks changed after learning • Number of blocks changed while learning • P-RRL vs. Q-RRL

  32. Results: Fixed number of blocks Accuracy: percentage of correctly classified (s,a) pairs (optimal-non-optimal) Accuracy of random policies

  33. Results: Fixed number of blocks

  34. Results: Evaluating learned policies on varying number of blocks

  35. Conclusion • RRL has satisfying initial results, needs more research • RRL is more successful when number of blocks is increased (generalizes better to more complex domains) • Theoretical research proving why it works is still missing

  36. Usando heuristicas para Aceleração do AR - pdf

More Related