1 / 13

Nov 14 th

Nov 14 th. Homework 4 due Project 4 due 11/26. How do we use planning graph heuristics?. Progression. Regression. Qn: How far should we grow each planning graph? Ans: Maximum to the level-off (i.e., prop-lists don’t change between consecutive levels).

jaeger
Download Presentation

Nov 14 th

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Nov 14th Homework 4 due Project 4 due 11/26

  2. How do we use planning graph heuristics? Progression Regression Qn: How far should we grow each planning graph? Ans: Maximum to the level-off (i.e., prop-lists don’t change between consecutive levels)

  3. Progression Need to compute a PG for each child state As many PGs as there are leaf nodes! Lot higher cost for heuristic computation Can try exploiting overlap between different PGs However, the states in progression are consistent.. So, handling negative interactions is not that important Overall, the PG gives a better guidance even without mutexes Regression Need to compute PG only once for the given initial state. Much lower cost in computing the heuristic However states in regression are “partial states” and can thus be inconsistent So, taking negative interactions into account using mutex is important Costlier PG construction Overall, PG’s guidance is not as good unless higher order mutexes are also taken into account Use of PG in Progression vs Regression Remember the Altimeter metaphor.. Historically, the heuristic was first used with progression planners. Then they used it with regression planners. Then they found progression planners do better. Then they found that combining them is even better.

  4. There is a whole lot more.. • Planning graph heuristics can also be made sensitive to • Negative interactions • Non-uniform cost actions • Partial satisfaction planning • Where actions have costs and goals have utilites and the best plan may not achieve all goals • See rakaposhi.eas.asu.edu/pg-tutorial • Or the AI Magazine paper Spring 2007

  5. What if you didn’t have any hard goals..?And got rewards continually?And have stochastic actions? MDPs as Utility-based problem solving agents

  6. Repeat [can generalize to have action costs C(a,s)] If Mij matrix is not known a priori, then we have a reinforcement learning scenario..

  7. What does a solution to an MDP look like? • The solution should tell the optimal action to do in each state (called a “Policy”) • Policy is a function from states to actions (* see finite horizon case below*) • Not a sequence of actions anymore • Needed because of the non-deterministic actions • If there are |S| states and |A| actions that we can do at each state, then there are |A||S| policies • How do we get the best policy? • Pick the policy that gives the maximal expected reward • For each policy p • Simulate the policy (take actions suggested by the policy) to get behavior traces • Evaluate the behavior traces • Take the average value of the behavior traces. We will concentrate on infinite horizon problems (infinite horizon doesn’t necessarily mean that that all behavior traces are infinite. They could be finite and end in a sink state)

  8. Think of these as h*() values… Called value function U* Think of these as related to h* values Repeat U* is the maximal expected utility (value) assuming optimal policy

  9. Why are they called Markov decision processes? • Markov property means that state contains all the information (to decide the reward or the transition) • Reward of a state Sn is independent of the path used to get to Sn • Effect of doing an action A in state Sn doesn’t depend on the way we reached state Sn • (As a consequence of the above) Maximal expected utility of a state S doesn’t depend on the path used to get to S • Markov properties are assumed (to make life simple) • It is possible to have non-markovian rewards (e.g. you will get a reward in state Si only if you came to Si through SJ • E.g. If you picked up a coupon before going to the theater, then you will get a reward • It is possible to convert non-markovian rewards into markovian ones, but it leads to a blow-up in the state space. In the theater example above, add “coupon” as part of the state (it becomes an additional state variable—increasing the state space two-fold). • It is also possible to have non-markovian effects—especially if you have partial observability • E.g. Suppose there are two states of the world where the agent can get banana smell

  10. MDPs and Deterministic Search • Problem solving agent search corresponds to what special case of MDP? • Actions are deterministic; Goal states are all equally valued, and are all sink states. • Is it worth solving the problem using MDPs? • The construction of optimal policy is an overkill • The policy, in effect, gives us the optimal path from every state to the goal state(s)) • The value function, or its approximations, on the other hand are useful. How? • As heuristics for the problem solving agent’s search • This shows an interesting connection between dynamic programming and “state search” paradigms • DP solves many related problems on the way to solving the one problem we want • State search tries to solve just the problem we want • We can use DP to find heuristics to run state search..

  11. Optimal Policies depend on rewards.. Repeat - - - -

  12. Repeat (Value) (“sequence of states” = “behavior”) How about deterministic case? U(si) is the shortest path to the goal 

More Related