1 / 22

Partially-Observable Markov Decision Processes

Partially-Observable Markov Decision Processes. Tom Dietterich. Markov Decision Process as a Decision Diagram. Note: We observe before we choose All states, actions, and rewards are observed. What If We Can’t Directly Observe the State?. Note: We observe before we choose

lave
Download Presentation

Partially-Observable Markov Decision Processes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Partially-Observable Markov Decision Processes Tom Dietterich MCAI 2013

  2. Markov Decision Processas a Decision Diagram Note: We observe before we choose All states, actions, and rewards are observed MCAI 2013

  3. What If We Can’t Directly Observe the State? Note: We observe before we choose Only the observations are observed, not the underlying states MCAI 2013

  4. POMDPs are Hard to Solve • Tradeoff between taking actions to gain information and taking actions to change the world • Some actions can do both MCAI 2013

  5. Optimal Management of Difficult-to-Observe Invasive Species [Regan et al., 2011] • Branched Broomrape (Orobancheramosa) • Annual parasitic plant • Attaches to root system of host plant • Results in 75-90% reduction in host biomass • Each plant makes ~50,000 seeds • Seeds are viable for 12 years MCAI 2013

  6. Quarantine Area in S. Australia • 375 farms; 70km x 70km area Google maps MCAI 2013

  7. Formulation as a POMDP:Single Farm • States: • {Empty, Seeds, Plants & Seeds} • Actions: • {Nothing, Host Denial, Fumigation} • Observations: • {Absent, Present} • Detection probability • Rewards: • Cost(Nothing) Cost(Host Denial) Cost(Fumigation) • Objective: • 20-year discounted reward (discount = 0.96) State Diagram MCAI 2013

  8. Optimal MDP Policy • If plant is detected, Fumigate; Else Do Nothing • Assumes perfect detection www.grdc.com.au MCAI 2013

  9. Optimal POMDP Policy for • Same as the Optimal MDP Policy Action OBSERVATION ABSENT Decision State Fumigate ABSENT Nothing 1 0 After State PRESENT PRESENT MCAI 2013

  10. Optimal Policy for ABS ... Nothing Fumigate Deny Deny ABS ABS ABS 0 1 2 16 PRESENT PRESENT PRESENT PRESENT • Deny Host for 15 years before switching to Nothing • For Deny Host for 17 years before switching to Nothing MCAI 2013

  11. Probability of Eradication MCAI 2013

  12. Discussion • POMDP is exactly solvable because the state space is very small • Real problem is more complex • Each farm can have many fields, each with its own hidden state • There 375 farms in the quarantine area • states if we treat each farm as a single unit • Exact solution of large POMDPs is beyond the state of the art • Notice that there is no tradeoff between acting to gather information and acting to change the world. None of the actions gain information MCAI 2013

  13. Ways to Avoid a POMDP (1) • State Estimation and State Tracking • In many problems, we have (or can acquire) enough sensors so that we can estimate the state quite well • has low uncertainty • Let be the most likely hidden state • In such problems, we can pretend that we have an MDP and we can directly observe • We do not need to take actions to gain information, so we do not face this difficult tradeoff MCAI 2013

  14. Ways to Avoid a POMDP (2) • Pure Information-Gathering POMDPs • Consider a medical diagnosis case for a specific disease where there are tests, that can be performed. Our goal is to decide whether the patient has the disease by choosing tests to perform • Each test has two possible outcomes and • Each test has a cost • Given any subset of the outcomes, we can compute the probability that the patient has the disease • There is a “false positive” cost for incorrectly saying that and a “false negative” cost, for saying that MCAI 2013

  15. Formulation as an MDP • States: • starting state is • Actions • actions are the medical tests • action says “the patient does not have the disease” and terminates with cost 0 if correct and cost if incorrect • action says “the patient has the disease” and terminates with cost 0 if correct and cost if incorrect • State Transitions • When we perform test in state , the resulting state sets the th entry in the state to according to • When we perform a “declare” action, the problem transitions to a terminal state with probability 1 • If there aren’t too many tests and we know , we can enumerate the states and solve this via standard MDP methods MCAI 2013

  16. Belief States • In general, we can think of a POMDP as being an MDP over a Belief State • In the medical diagnosis cases, the belief states have the form (0,1,?,?,0,?) • In the Broomrape case, the belief state is a probability distribution over the 3 states: weeds + seeds empty seeds MCAI 2013

  17. Belief State Reasoning • Each observation updates the belief state • Example: observing the presence of weeds means weeds are present and seeds might also be present observe present weeds + seeds weeds + seeds empty empty seeds seeds MCAI 2013

  18. Taking Actions • Each action updates the belief state • Example: fumigate fumigate weeds + seeds weeds + seeds empty empty seeds seeds MCAI 2013

  19. Belief MDP • State space: all reachable belief states • Action space: same actions as the POMDP • Reward function: expected rewards derived from the underlying states • Transition function: moves in belief space • Problem: Belief space is continuous and there can be an immense number of reachable states MCAI 2013

  20. Monte Carlo Policy Evaluation • Key Insight: It is just as easy to evaluate a policy via Monte Carlo trials in a POMDP as it is an in MDP! • Approach: • Define a space of policies • Evaluate them by Monte Carlo trials • Pick the best one MCAI 2013

  21. Finite State Machine Policies • In many POMDPs (and MDPs), a policy can be represented as a finite state machine • We can design a set of FSM policies and then evaluate them • There are algorithms for incrementally improving FSM policies ABS ... Nothing Fumigate Deny Deny ABS ABS ABS 0 1 2 16 PRESENT PRESENT PRESENT PRESENT MCAI 2013

  22. Summary • Many problems in AI can be formulated as POMDPs • Formulating a problem as a POMDP doesn’t help much, because they are so hard to solve (PSPACE-hard for finite horizon; undecidable for infinite horizon) • Can we do state estimation and pretend ? • Are we performing pure observation actions? • Can the policy be divided into a pure observation phase and a pure action phase? • If so, we can use MDP methods instead • Unfortunately, many problems in ecosystem management are “essential” POMDPs that mix information gathering and world-changing actions • Monte Carlo methods (based on policy space search) are one of the most practical ways of finding good POMDP solutions MCAI 2013

More Related