1 / 37

Advances in Point-Based POMDP Solvers

Advances in Point-Based POMDP Solvers. Guy Shani Ronen Brafman Solomon E. Shimony. Overview. Agenda: Introduce point-based POMDP solvers. Overview recent advancements. Structure: Background – MDPs, POMDPs. Point-based solvers – Belief set selection. Value function computation.

kirima
Download Presentation

Advances in Point-Based POMDP Solvers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advances in Point-Based POMDP Solvers Guy Shani Ronen Brafman Solomon E. Shimony

  2. Overview • Agenda: • Introduce point-based POMDP solvers. • Overview recent advancements. • Structure: • Background – MDPs, POMDPs. • Point-based solvers – • Belief set selection. • Value function computation. • Experiments.

  3. Markov Decision Process - MDP • Model agents in a stochastic environment. • State – an encapsulation of all the relevant environment information: • Agent location • Can the agent eat monsters? • Monsters location • Gold coins location • Action – affect the environment: • Moving up, down, left, right • Stochastic effects – • Movement can sometime fail • Monster movements are random • Reward – received for achieving goals • Collecting coins • Eating a monster

  4. MDP Formal Definition • Markov property – action effects depend only on the current state. • MDP is defined by the tuple <S,A,tr,R>. • S – state space • A – action set • tr – state transition function: tr(s,a,s’)=pr(s’|s,a) • R – reward function: R(s,a)

  5. Policies and Value Functions • Policy – specifies an action for each state. • Optimal policy – maximizes the collected rewards: • Sum: • Average: • Discounted sum: • Value function – assigns a value to a state

  6. Value Iteration (Bellman 1957) • Dynamic programming method. • Value is updated from reward states backwards. • Update is known as a backup.

  7. Value Iteration (Bellman 1957) Initialize – V0(s) = 0, n = 0 While V has not converged For each s n = n + 1 • Known to converge to V*- the optimal value function. • π* - the optimal policy corresponds to the optimal value function. Bellman update

  8. Policy Iteration (Howard 1960) • Intuition – we care about policies, not about value functions. • Changes in the value function may not affect the policy. • Expectation-Maximization. • Expectation – fix the policy and compute its value. • Maximization – change the policy to maximize the values.

  9. Partial Observability • Real agents cannot directly observe the state. • Sensors – provide partial and noisy information about the world.

  10. Partially Observable MDP - POMDP • The environment is Markovian. • The agent cannot directly view the state. • Sensors give observations over the current state. • Formal POMDP model: • <S, A, tr, R> – an MDP (the environment) • Ω – set of possible observations • O(a,s,o) – observation probability given action and state – pr(o|a,s).

  11. Value of Information • POMDPs capture the value of information. • Example – we don’t know where the larger reward is – should we go and read the map? • Answer – it depends on: • The difference between the rewards. • The cost of reading the map. • The accuracy of the map. • POMDPs take all such considerations into account and provide an optimal policy.

  12. Belief States • The agent does not directly observe the environment state. • Due to noisy and insufficient information, the agent has a belief over the current world state. • b(s) is the probability of being at state s. • τ(b,a,o) - a deterministic function computing the next belief state given action a and observation o. • The agent knows its initial belief state – b0

  13. Value Function (Sondik 1973) • A value function V assigns a value to a belief state b. • V* - the optimal value function. • V*(b) – the expected reward if the agent will behave optimally starting from belief state b. • V is traditionally represented as a set of α-vectors. • V(b) = maxαα·b (upper envelope). • α·b = ∑sα(s)b(s) α1 α0 b=<0.4,0.6> s0 s1

  14. Exact Value Iteration • Creates a new set of α-vectors. • Exponential explosion of vectors. • Dominated vectors can be pruned. (Littman et al. 1997) • Pruning process is time consuming.

  15. Point-Based Backups (Pineau et al. 2001) • Bellman update (backup): • Vn+1(b) = maxa ra·b + γ∑opr(o|b,a)Vn(τ(b,a,o)) • Can be written using vector notation – • backup(b) = argmaxa gb,a·b • gb,a = ra + γ∑o argmaxαgα,a,o·b • gα,a,o(s) = ∑s’O(a,s’,o)tr(s,a,s’)α(s’) • Computes a new α-vector optimal for a specific input belief point b. • Known as a point-based backup.

  16. Point-based Solvers • Compute a value function V over a subset B of the belief space. • Usually only reachable belief points are used. • Use α-vectors to represent V. • Assumption: an optimal value function over B will generalize well to other, unobserved belief points. • Advantage – each vector must maximize some b in B. Dominated vectors are pruned implicitly.

  17. Variations of Point-Based Solvers • A number of algorithms were suggested: • PBVI ( Pineau et al. 2001) • Perseus (Spaan and Vlasis 2003) • HSVI (Smith and Simmons 2005) • PVI (Shani et al. 2006) • FSVI (Shani et al. 2007) • SCVI (Virin et al. 2007) • Differences between algorithms: • Selection of B –fixed/expanding set, traversal/distance. • Computation of V – which points are updated, what is the order of backups.

  18. Belief Set Selection • Option 1 – expanding belief set • PBVI [Pineua et al. 2001] • B0 = {b0} – the initial belief state • Bn+1– for each b, add an immediate successor b’ = τ(b,a,o) s.t. dist(Bn,b’) is maximal. • Assumption – at the limit, B will include all reachable belief states and therefore V would converge to an optimal value function.

  19. Goal B Candidates b0

  20. Belief Set Selection • Option 2 – Random walk • Perseus [Spaan & Vlassis 2004] • Run a number of trials beginning at b0. • n is trial length : • for i = 0 to n • ai=random action • oi=random observation • bi+1= τ(bi, ai, oi) • B is the set of all observed belief states. • Assumption – a sufficiently long exploration would visit all “important” belief points. • Disadvantage – may add many “irrelevant” belief points.

  21. Goal B

  22. Belief Set Selection • Option 3 – Heuristic Exploration • Run a number of trials beginning at b0. • while stopping criterion was not reached • ai=choose action • oi=choose observation • bi+1= τ(bi, ai, oi) • i++ • HSVI [Smith & Simmons 2005] – • Maintains a lower bound and an upper bound over the V*. • Choose best a according to the upper bound. • Choose o such that bi+1 has the largest gap between bounds.

  23. Forward Search Value Iteration[Shani et al. 2007] • A POMDP agent cannot directly obtain the environment state. • In simulation we may assume that the environment state is available. • Idea – use the simulated environment state to guide exploration in belief space.

  24. Forward Search Value Iteration[Shani, Brafman, Shimony 2007] ai*←best action for si si+1←choose from tr(si,ai*,·) oi←choose from O(ai*,si+1,·) bi+1← τ(bi,ai*,oi) b3 a2,o2 s3 b2 a2 a1,o1 s2 b1 a1 s1 a0,o0 a0 MDP state space POMDP belief space b0 s0

  25. Value Function Update • PBVI – An α-vector for each belief point in B. Arbitrary order of update. • Perseus – Randomly select next point to update from the points which were not yet improved. • HSVI+FSVI – Over each belief state traversal, execute backups in reversed order.

  26. PBVI • Many vectors may not participate in the upper envelope. • All points are updated before a point can be updated twice (synchronous update). • It is possible for a successor of a point to be updated after that point, causing slow update of values. b0 b1 b2 b3 b4 b5 b6

  27. HSVI & FSVI • Advantage – backups exploit previous backups on successors. b0 b1 b2 b3 b4 b5 b6

  28. Perseus • Advantage • small number of vectors in each iteration. • All points are improved, but not all are updated. • Disadvantage • May choose points that are slightly improved and avoid points that can be highly improved. b0 b1 b2 b3 b4 b5 b6

  29. Backup Selection • Perseus generates good value functions. • Can we accelerate the convergence of V? • Idea – choose points to backup smartly so that value function improves considerably after each backup.

  30. Prioritizing Backups • Update a point b where the Bellman error e(b) = HV(b) - V(b) is maximal. • Well known in MDPs. • Problem – unlike MDPS, after improving b, updating the error for all other points is difficult: • The list of predecessors of b cannot be computed. • A new α-vector may improve the value for more than a single belief point. • Solution – recompute error over a sampled subset of B. Select point with maximal error from that set.

  31. PVI[Shani, Brafman, Shimony 2006] • Advantage • all backups result in value function improvement. • Backups are optimal locally. • Disadvantage • HV(B) computations are expensive HV(B) b0 b1 b2 b3 b4 b5 b6

  32. Clustered Value Iteration[Virin, Shani, Shimony, Brafman, 2007] • Compute a clustering of the belief space. • Iterate over the clusters and backup only belief points from the current cluster. • Clusters built such that a state is usually updated after its successors.

  33. Value Directed Clustering • Compute the MDP optimal value function. • Cluster MDP states by their MDP value. • Define a soft clustering over belief space: pr(b in c) = Σs in c b(s). • Iterate over the clusters by decreasing cluster value: V(c) = 1/|C| Σs in c V(s). • Update all belief points such that pr(b in c) exceeds a threshold.

  34. Results – example domains

  35. Experimental ResultsCPU Time: HSVI vs. SCVI

  36. Experimental ResultsCPU Time: HSVI vs. FSVI

  37. Summary • Point-based solvers are able to scale up to POMDPs with millions of states. • Algorithms differ in the selection of the belief points and the order of backups. • A smart order of backups can be computed using prioritization and clustering. • Trial-based algorithms are an alternative. FSVI is the fastest algorithm of this family.

More Related