1 / 29

Reward Functions for Accelerated Learning

Reward Functions for Accelerated Learning. Presented by Alp Sardağ. Why RL?. RL a methodology of choice for learning in a variety of different domains. Convergence property. Potential biological relevance. RL is good in: Game playing Simulations. Cause of Failure.

kaiya
Download Presentation

Reward Functions for Accelerated Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reward Functions for Accelerated Learning Presented by Alp Sardağ

  2. Why RL? • RL a methodology of choice for learning in a variety of different domains. • Convergence property. • Potential biological relevance. • RL is good in: • Game playing • Simulations

  3. Cause of Failure • Fundamental assumption of RL models, the belief that the agent-environment interaction can be modeled as a MDP. • A and E are synchronized finite state automata. • A and E interact in discrete time intervals. • A can sense the state of E and use it to act. • After A acts, E transitions to a new state. • A receives a reward after performing an action.

  4. States vs. Descriptors • Traditional RL depend on accurate state information, where as in physical robot environments: • Even for simplest agents state space is very large. • Sensor inputs are noisy • The agent usually percieve local

  5. Transitions vs. Events • World and agent states change asynchronously, in response to events not all are caused by the agent. • Same event can vary in duration under different circumstances and have different consequences. • Nondeterministic and stochastic models are more close to real world. However, the information for establishing a stochastic model is not usually available.

  6. Learning Trials • Generating a complete policy requires a search in a large size of the state space. • In real world, the agent cannot choose what states it will transition to, and cannot visit all states. • Convergence in real world depends only on focusing only on the relevant parts of state. • The better the problem formulated, fewer learning trials.

  7. Reinforcement vs. Feedback • Current RL work uses two types of reward: • Immediate • Delayed • Real world situations tend to fall in between the two popular extremes. • Some immediate rewards • Plenty of intermittent rewards • Few very delayed rewards

  8. Multiple Goals • Traditional RL deal with specialized problems in which the learning task can be specified with a single goal. The problems: • Very specific task learned • Conflicts with any future learning • The extension: • Sequentially formulated goals where state space explicitly encode what goals reached so far. • Use separate state space and reward function for each goal. • W-learning: competition among selfish Q-learners.

  9. Goal • Given the complexity and uncertanity of real world domains, a learning model, that minimizes the state space and maximizes the amount of learning at each trial.

  10. Intermediate Rewards • Interminent rewards can be introduced : • Reinforcing multiple goals, by using progress estimators. • Heterogenous Reinforcement Function: In real worlds multiple goal exists, it is natural to reinforce individually rather than a monolithic goal.

  11. Progress Estimators • Partial internal critics associated with specific goal, provide a metric of improvement relative to those goal. They are importanat in noisy worlds: • Decrease the learner’s sensitivity to intermittent errors. • Encourage the exploration, without them, the agent can trash repeadetly attempting inappropriate behaviors.

  12. Experimental Design • To validate the proposed approach, experiments designed for comparing new RL with traditional RL. • Robots • Learning Task • Learning Algorithm • Control Algorithm

  13. Robots • In the experiments four fully autonomous R2 mobile robots consisting of: • Differentially steerable • Gripper for lifting objects • Piezo-electric bump sensor for detecting contact-collisions and monitoring the grasping force. • Set of IR for obstacle avoidance. • Radio tranceivers, used for determining absolute posiiton.

  14. Robot Algorithm • The robots are programmed in the behavior language: • Based on the subsumption architecture. • Parallel control system formed concurrently active behaviors, some of which gather information, some drive effectors, and some monitor progress and contribute reinforcement.

  15. The Learning Task • The learning task consists of finding a mapping of all conditions and behaviors into the most efficient policy for group foraging. • Basic behaviors from which to learn behavior selection: • Avoiding • Searching • Resting • Dispersing • Homing

  16. The Learning Task Cont. • The state space can be reduced to the cross-product of the following state variables: • Have-puck? • At-home? • Near-intruder? • Night-time?

  17. Learning Task Cont. • Instinctive behaviors because learning them has a high cost: • As soon as robot detects a puck between its fingers, it grasps it. • As soon as the robot reaches the home region, it drops a puck if ti is carrying one. • Whenever the robot is too near an obstacle, it avoids.

  18. The learning Algorithm • The algorithm produces and maintains a matrix where appropriatness of behaviors associated with each state is kept. • The values in the matrix fluctuates over time based on received reinforcement, and are updated asynchronously, with any received reward.

  19. The Learning Algorithm

  20. The Learning Algorithm • The algorithm sums the reinforcement over time: • The influence of the different types of feedback was weighted by the values of feedback constant:

  21. The Control Algorithm • Whenever an event is detected, the following control sequence is executed: • Appropriate reinforcement delivered for current condition-behavior pair, • The current behavior is terminated, • Another behavior is selected. • Behaviors are selected according to the following rule: • Choose an untried behavior if one is available. • Otherwise choose best behavior

  22. Experimental Results • The following three approaches are compared: • A monolithic single-goal (puck delivery to the home region) reward function using Q-learning, R(t)=P(t) • A heterogeneous reinforcement function using multiple goals: R(t)=E(t), • A heterogeneous reinforcement function using multiple goals and two progress estimator function: R(t)=E(t)+I(t)+H(t)

  23. Experimental Results • Values are collected twice per minute. • The final learning values are collected after 15 minute run. • Convergence is defined as relative ordering of condition-behavior pairs.

  24. Evaluation • Given the undeterminism and noisy sensor inputs the single goal provides insufficient feedback. It was vulnerable to interference. • The second learning strategy outperforms second because it detects the achievement of subgoals on the way of top level goal of depositing pucks at home. • The complete heterogenous reinforcement and progress estimator outperforms the others because it uses of all available information for every condition and behavior.

  25. Additional Evaluation • Evaluated each part of the policy separately, according the following criteria: • Number of trials required, • Correctness, • Stability. • Some condition-behavior pairs proved to be much more difficult to learn than others: • without progress estimator • rare states

  26. Discussion • Summing reinforcement • Scaling • Transition models

  27. Summing Reinforcement • Allows for oscillations. • In theory, the more reinforcement the faster the learning. In practice noise and error could have the opposite effect. • The experiments described here demonstrate that even with a significant amount of noise, multiple reinforcers and progress estimators significantly accelerate learning.

  28. Scaling • Interference was detriment to all three approach. • In terms of the amount of time required,The learned group foraging strategy outperformed hand-coded greedy agent strategies. • Foraging can be improved further by minimizing interference. Only one robot move at a time.

  29. Transition Models • In case of noisy and uncertain environments transition model is not available to aid the learner. • The absence of a model made it difficult to compute discounted future reward. • Future work: applying this approach to problems that involve incomplete and approximate state transition models.

More Related