Tópicos Especiais em Aprendizagem

Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012

Objetivo desta Aula Esta é uma aula mais informativa • Aprendizado por Reforço: • Planning and Learning. • RelationalReinforcement Learning • Uso de Heurísticas para acelerar o AR. • DimensionsofRL - Conclusões. • Aula de hoje: • Capítulos 9 e 10 do Sutton & Barto. • Artigo Relational RL, MLJ, de 2001 • Tese de Doutorado Bianchi.

Planning and Learning Capítulo 9 do Sutton e Barto

Objetivos • Use of environment models. • Integration of planning and learning methods.

Models • Model: anything the agent can use to predict how the environment will respond to its actions • Distribution model: description of all possibilities and their probabilities • e.g., • Sample model: produces sample experiences • e.g., a simulation model • Both types of models can be used to produce simulated experience • Often sample models are much easier to come by

Planning • Planning: any computational process that uses a model to create or improve a policy. • Planning in AI: • state-space planning • plan-space planning (e.g., partial-order planner)

Planning in RL • We take the following (unusual) view: • all state-space planning methods involve computing value functions, either explicitly or implicitly. • they all apply backups to simulated experience.

Planning Cont. • Classical DP methods are state-space planning methods. • Heuristic search methods are state-space planning methods.

Q-Learning Planning • A planning method based on Q-learning: Random-Sample One-Step Tabular Q-Planning

Learning, Planning, and Acting • Two uses of real experience: • model learning: to improve the model • direct RL: to directly improve the value function and policy • Improving value function and/or policy via a model is sometimes called indirect RL or model-based RL. Here,we call it planning.

Indirect methods: make fuller use of experience: get better policy with fewer environment interactions Direct methods simpler not affected by bad models Direct vs. Indirect RL But they are very closely related and can be usefully combined: - planning, acting, model learning, and direct RL can occur simultaneously and in parallel

The Dyna Architecture (Sutton 1990)

The Dyna-Q Algorithm direct RL model learning planning

Dyna-Q on a Simple Maze rewards = 0 until goal, when =1

Dyna-Q Snapshots: Midway in 2nd Episode

Prioritized Sweeping • Which states or state-action pairs should be generated during planning? • Work backwards from states whose values have just changed: • Maintain a queue of state-action pairs whose values would change a lot if backed up, prioritized by the size of the change • When a new backup occurs, insert predecessors according to their priorities • Always perform backups from first in queue • Moore and Atkeson 1993; Peng and Williams, 1993

Prioritized Sweeping

Prioritized Sweeping vs. Dyna-Q Both use N=5 backups per environmental interaction

Summary • Emphasized close relationship between planning and learning • Important distinction between distribution models and sample models • Looked at some ways to integrate planning and learning • synergy among planning, acting, model learning

Summary • Distribution of backups: focus of the computation • trajectory sampling: backup along trajectories • prioritized sweeping • heuristic search • Size of backups: full vs. sample; deep vs. shallow

Relational Reinforcement Learning Baseadoem aula de TayfunGürel

Relational Representations • In most applications the state space is too large. • Generalization over states is essential • Many states are similar for some aspects • Representations for RL have to be enriched for generalization • RRL was proposed (Dzeroski, DeRaedt, Blockeel 1998)

Blocks World An action: move(a,b) precondition: clear(a)  S clear(b)  S s1={clear(b), clear(a), on(b,c), on(c,floor), on(a,floor) }

Relational Representations • With Relational Representations for states: • Abstraction from details q_value(0.72) = goal_unstack , numberofblocks(A) , action_move(B,C), height(D,E), E=2, on(C,D),!. • Flexible to goal changes • Retraining from the beginning is not necessary • Transfer of experience to more complex domains

Relational Reinforcement Learning • How does it work? • An integration of RF with ILP: • Do forever: • Use Q-learning to generate sample Q values for sample states action pairs • Generalize them using ILP (in this case TILDE)

TILDE (Top Down Induction of LogicalDecision Trees ) • A generalization of Q values is represented by a logical decision tree • Logical Decision Trees: • Nodes are First Order Logic atoms (Prolog Queries as Tests ) • e.g. on (A , c) : is there any block on c • Training Data is a relational database or a Prolog knowledge base

Logical Decision Tree vs. Decision Tree Decision Tree and Logical Decision Tree deciding whether blocks are stacked

TILDE Algorithm Declarative bias: e.g. on(+,-) Background knowledge: A prolog program: An example part of the program can be:

Q-RRL Algorithm Examples generated by Q-RR learning

Logical regression tree generated by TILDE-RT Equivalent prolog program

EXPERIMENTS • Tested for three different goals: • 1. one-stack • 2. on(a,b) • 3. unstack • Tested for the following as well: • Fixed number of blocks • Number of blocks changed after learning • Number of blocks changed while learning • P-RRL vs. Q-RRL

Results: Fixed number of blocks Accuracy: percentage of correctly classified (s,a) pairs (optimal-non-optimal) Accuracy of random policies

Results: Fixed number of blocks

Results: Evaluating learned policies on varying number of blocks

Conclusion • RRL has satisfying initial results, needs more research • RRL is more successful when number of blocks is increased (generalizes better to more complex domains) • Theoretical research proving why it works is still missing

Usando heuristicas para Aceleração do AR - pdf

Tópicos Especiais em Aprendizagem

Tópicos Especiais em Aprendizagem

Presentation Transcript