1 / 58

KI2 – MDP / POMDP

KI2 – MDP / POMDP. Kunstmatige Intelligentie / RuG. Decision Processes. Agent Perceives environment (S t ) flawlessly Chooses action (a) Which alters the state of the world (S t+1 ). finite state machine. geen signalen. zie BAL . geen signalen. A1: lummel wat rond. zie BAL .

wiley
Download Presentation

KI2 – MDP / POMDP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. KI2 – MDP / POMDP Kunstmatige Intelligentie / RuG

  2. Decision Processes Agent Perceives environment (St) flawlessly Chooses action (a) Which alters the state of the world (St+1)

  3. finite state machine geen signalen zie BAL geen signalen A1: lummel wat rond zie BAL A2: volg object geen signalen zie obstakel zie obstakel zie BAL A3: houd afstand zie obstakel

  4. Stochastic Decision Processes Agent Perceives environment (St) flawlessly Chooses action (a) according to P(a|S) Which alters the state of the world (St+1) according to P(St+1|St,a)

  5. Markov Decision Processes Agent Perceives environment (St) flawlessly Chooses action (a) according to P(a|S) Which alters the state of the world (St+1) according to P(St+1|St,a)  If no longer-term dependencies: 1st order Markov process

  6. Aannames De waarneming van St is zonder ruis, alle benodigde informatie is waarneembaar Acties a worden volgens kans P(a|S) geselecteerd (random generator) Gevolgen van a in (St+1) treden stochastisch op met kans P(St+1|St,a)

  7. A policy +1 -1 START

  8. A policy +1 -1 START

  9. MDP States Actions Transitions between states P(ai|sk) “policy”: beleid, welke a men zoal beslist gegeven de mogelijke omstandigheden s

  10. policy π • “ argmax ai P(ai|sk) “ • Hoe kan een agent dit leren? • Kostenminimalisatie • Beloning/straf uit omgeving als gevolg van gedrag/acties (Thorndike, 1911) •  Reinforcements R(a,S) •  Structuur wereld T = P(St+1|St)

  11. Reinforcements • Gegeven een historie van States, Actions en de resulterende Reinforcements kan een agent leren de waarde van een Actie te schatten. • Hoe: Som van reinforcements R? Gemiddelde? •  exponentiële weging • eerste stap bepaalt alle latere (learning from past) • directe beloning is nuttiger (rekenen over toekomst) • impatience & mortality

  12. Assigning Utility (Value) to Sequences Discounted Rewards: V(s0,s1,s2…) = R(s0) + R(s1) + 2R(s2)… where 0<≤1 where R is reinforcement value, s refers to state, is the the discount factor

  13. Assigning Utility to States NO!!! • Can we say V(s)=R(s)? • “ de utiliteit van een toestand is de verwachte utiliteit van alle toestanden die erop zullen volgen, wanneer beleid (policy) π wordt gehanteerd” • Transitiekans T(s,a,s’)

  14. Assigning Utility to States • Can we say V(s)=R(s)? • Vπ(s) is specific to each policy π • Vπ(s) = E(tR(st)| π, s0=s) • V(s)= Vπ *(s) • V(s)=R(s) +  max T(s,a,s’)V(s’) a s’ Bellman equation If we solve function V(s) for each state we will have solved the optimal π* for the given MDP

  15. Value Iteration Algorithm • We have to solve |S| simultaneous Bellman equations • Can’t solve directly, so use an iterative approach: 1. Begin with arbitrary initial values V0 2. For each s, calculate V(s) from R(s) and V0 3. Use these new utility values to update V0 4. Repeat steps 2-3 until V0 converges This equilibrium is a unique solution! (see R&N for proof) page 621 R&N

  16. Search space • T: S*A*S • Explicit enumeration of combinations is often not feasible (cf. chess, GO) • Chunking within T • Problem: if S is real valued

  17. MDP  POMDP • MDP: wereld is weliswaar stochastisch, Markoviaans, maar: • De waarneming van die wereld zelf is betrouwbaar, er hoeven geen aannames worden gemaakt. • De meeste ‘echte’ problemen omvatten: • ruis in de waarneming zelf • onvolledigheid van informatie

  18. MDP  POMDP • De meeste ‘echte’ problemen omvatten: • ruis in de waarneming zelf • onvolledigheid van informatie • In deze gevallen moet de agent een stelsel van “Beliefs” kunnen ontwikkelen op basis van series partiële waarnemingen.

  19. Partially Observable Markov Decision Processes (POMDPs) • A POMDP has: • States S • Actions A • Probabilistic transitions • Immediate Rewards on actions • A discount factor • +Observations Z • +Observation probabilities (reliabilities) • +An initial belief b0

  20. A POMDP example: The Tiger Problem

  21. The Tiger Problem • Description: • 2 states: Tiger_Left, Tiger_Right • 3 actions: Listen, Open_Left, Open_Right • 2 observations: Hear_Left, Hear_Right

  22. The Tiger Problem • Rewards are: • -1 for the Listen action • -100 for the Open(x) in the Tiger-at-x state • +10 for the Open(x) in the Tiger-not-at x state

  23. The Tiger Problem • Furthermore: • The Listen action does not change the state • The Open(x) action reveals the tiger behind a door x with 50% chance, and resets the trial. • The Listen action gives the correct information 85% of the time: p(hearleft | Listen, tigerleft) = 0.85 p(hearright | Listen, tigerleft) = 0.15

  24. The Tiger Problem • Question: • what policy gives the highest return in rewards? • Actions depend on beliefs! • If belief is: 50/50 L/R, the expected reward will be R = 0.5 * (-100 + 10) = -45 • Beliefs are updated with observations (which may be wrong)

  25. The Tiger Problem, horizon t=1 • Optimal policy:

  26. The Tiger Problem, horizon t=2 • Optimal policy:

  27. The Tiger Problem, horizon t=Inf • Optimal policy: • listen a few times • choose a door • next trial • listen1: Tigerleft (p=0.85), listen2: Tigerleft (p=0.96),listen3: ... (binomial) • Good news: the optimal policy can be learnedif actions are followed by rewards!

  28. The Tiger Problem, belief updates on “Listen” • P(Tiger|Listen,State)t+1 = P(Tiger|Listen,State)t / ( P(Tiger|Listen,State)t * P(Listen) + (1-P(Tiger|Listen,State)t ) * (1-P(Listen) ) • Example: • initial: (Tigerleft) (p=0.5000), listen1: Tigerleft (p=0.8500), listen2: Tigerleft (p=0.9698),listen3: Tigerleft (p=0.9945),listen4: ...(Note: underlying binomial distribution)

  29. The Tiger Problem, belief updates on “Listen” • P(Tiger|Listen,State)t+1 = P(Tiger|Listen,State)t / ( P(Tiger|Listen,State)t * P(Listen) + (1-P(Tiger|Listen,State)t ) * (1-P(Listen) ) • Example 2, noise in observation: • initial: (Tigerleft) (p=0.5000), listen1: Tigerleft (p=0.8500), listen2: Tigerleft (p=0.9698),listen3: Tigerright (p=0.8500), Belief drops...listen4: Tigerleft (p=0.9698), and recoverslisten5: ...

  30. Solving a POMDP • To solve a POMDP is to find, for any action/observation history, the action that maximizes the expected discounted reward:

  31. The belief state • Instead of maintaining the complete action/observation history, we maintain a belief state b. • The belief is a probability distribution over the states. Dim(b) = |S|-1

  32. The belief space Here is a representation of the belief space when we have two states (s0,s1)

  33. The belief space Here is a representation of the belief state when we have three states (s0,s1,s2)

  34. The belief space Here is a representation of the belief state when we have four states (s0,s1,s2,s3)

  35. The belief space • The belief space is continuous but we only visit a countable number of belief points.

  36. The Bayesian update

  37. Value Function in POMDPs • We will compute the value function over the belief space. • Hard: the belief space is continuous !! • But we can use a property of the optimal value function for a finite horizon: it is piecewise-linear and convex. • We can represent any finite-horizon solution by a finite set of alpha-vectors. • V(b) = max_α[Σ_s α(s)b(s)]

  38. Alpha-Vectors • They are a set of hyperplanes which define the belief function. At each belief point the value function is equal to the hyperplane with the highest value.

  39. Belief Transform • Assumption: • Finite action • Finite observation • Next belief state = T(cbf,a,z) where cbf: current belief state, a:action, z:observation • Finite number of possible next belief state

  40. PO-MDP into continuous CO-MDP • The process is Markovian, the next belief state depends on: • Current belief state • Current action • Observation • Discrete PO-MDP problem can be converted into a continuous space CO-MDP problem where the continuous space is the belief space.

  41. Problem • Using VI in continuous state space. • No nice tabular representation as before.

  42. PWLC • Restrictions on the form of the solutions to the continuous space CO-MDP: • The finite horizon value function is piecewise linear and convex (PWLC) for every horizon length. • the value of a belief point is simply the dot product of the two vectors. GOAL:for each iteration of value iteration, find a finite number of linear segments that make up the value function

  43. Steps in Value Iteration (VI) • Represent the value function for each horizon as a set of vectors. • Overcome how to represent a value function over a continuous space. • Find the vector that has the largest dot product with the belief state.

  44. a2 is the best a1 is the best PO-MDP Value Iteration Example • Assumption: • Two states • Two actions • Three observations • Ex: horizon length is 1. b=[0.25 0.75] a1 a2 ] [ s1 s2 • 0 0 1.5 V(a1,b) = 0.25x1+0.75x0 = 0.25 V(a2,b)=0.25x0+0.75x1.5=1.125

  45. PO-MDP Value Iteration Example • The value of a belief state for horizon length 2 given b,a1,z1: • immediate action plus the value of the next action. • Find best achievable value for the belief state that results from our initial belief state b when we perform action a1 and observe z1.

  46. PO-MDP Value Iteration Example • Find the value for all the belief points given this fixed action and observation. • The Transformed value function is also PWLC.

  47. PO-MDP Value Iteration Example • How to compute the value of a belief state given only the action? • The horizon 2 value of the belief state, given that: • Values for each observation: z1: 0.7 z2: 0.8 z3: 1.2 P(z1| b,a1)=0.6; P(z2| b,a1)=0.25; P(z3| b,a1)=0.15 0.6x0.8 + 0.25x0.7 + 0.15x1.2 = 0.835

  48. Transformed Value Functions • Each of these transformed functions partitions the belief space differently. • Best next action to perform depends upon the initial belief state and observation.

  49. Best Value For Belief States • The value of every single belief point, the sum of: • Immediate reward. • The line segments from the S() functions for each observation's future strategy. • since adding lines gives you lines, it is linear.

  50. Best Strategy for any Belief Points • All the useful future strategies are easy to pick out:

More Related