1 / 63

Logistics

Logistics. Reading for Mon No class Wed 11/26. Classical AI planning. Operations Research. No uncertainty. Uncertainty. Achieve goals. Maximize utility. Active research area. Knowledge-based representation. Markov decision process. Dynamic programming. Heuristic search. Review .

farrah
Download Presentation

Logistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Logistics • Reading for Mon • No class Wed 11/26 (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  2. Classical AI planning Operations Research No uncertainty Uncertainty Achieve goals Maximize utility Active research area Knowledge-based representation Markov decision process Dynamic programming Heuristic search (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  3. Review • MDPs • Bayesian Networks • DBNs • Factored MDPs • BDDs & ADDs (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  4. Markov Decision Processes S = set of states set (|S| = n) A = set of actions (|A| = m) Pr = transition function Pr(s,a,s’) represented by set of m n x n stochastic matrices each defines a distribution over SxS R(s) = bounded, real-valued reward function represented by an n-vector (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  5. Planning • Plan? • Objective? • Policy? • Objective? (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  6. Initial value function DP improves value function -optimal value function Initial policy Evaluate policy DP improves policy -optimal policy Dynamic programming (DP) Value iteration [Bellman, 1957] Policy iteration [Howard, 1960] (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  7. Bellman’s Curse of Dimensionality (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  8. Earthquake Radio Burglary Alarm Nbr1Calls Nbr2Calls Bayes NetsCompact Rep’n Joint Prob, Distribution Pr(B=t) Pr(B=f) 0.05 0.95 Pr(A|E,B) e,b 0.9 (0.1) e,b 0.2 (0.8) e,b 0.85 (0.15) e,b 0.01 (0.99) (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  9. T T(t+1) T(t+1) T 0.91 0.09 F 0.0 1.0 DBN Representation: DelC RHM R(t+1) R(t+1) T 1.0 0.0 F 0.0 1.0 RHMt RHMt+1 fRHM(RHMt,RHMt+1) Mt Mt+1 fT(Tt,Tt+1) Tt Tt+1 L CR RHC CR(t+1) CR(t+1) O T T 0.2 0.8 E T T 1.0 0.0 O F T 0.0 1.0 E F T 0.0 1.0 O T F 1.0 0.1 E T F 1.0 0.0 O F F 0.0 1.0 E F F 0.0 1.0 Lt Lt+1 CRt CRt+1 RHCt RHCt+1 fCR(Lt,CRt,RHCt,CRt+1) (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  10. RHMt RHMt+1 Mt Mt+1 Tt Tt+1 Lt Lt+1 CRt CRt+1 RHCt RHCt+1 Benefits of DBN Representation s1 s2 ... s160 s1 0.9 0.05 ... 0.0 s2 0.0 0.20 ... 0.1 . . . s160 0.1 0.0 ... 0.0 • Only 48 parameters vs. • 25440 for matrix (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  11. OBDD Binary decision tree x3 x3 1 1 0 0 x2 x2 0 x1 x1 0 1 1 0 1 1 1 1 1 1 Example (x3 and x2) or not x1 (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  12. RHMt RHMt+1 Mt Mt+1 Tt Tt+1 Lt Lt+1 CRt CRt+1 RHCt RHCt+1 Action Representation – DBN/ADD Algebraic Decision Diagram (ADD) CR t RHC t f f L e o CR(t+1) CR(t+1) CR(t+1) f t f f t t 0.0 1.0 0.2 0.8 fCR(Lt,CRt,RHCt,CRt+1) (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  13. Today – Solving the curse • Abstraction • Approximation • Reachability (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  14. Structured Computation • Given compact representation, can we solve MDP without explicit state space enumeration? • Can we avoid O(|S|)-computations by exploiting regularities made explicit by propositional or first-order representations? • Two general schemes: • abstraction/aggregation • decomposition (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  15. State Space Abstraction • General method: state aggregation • group states, treat aggregate as single state • commonly used in OR [SchPutKin85, BertCast89] • viewed as automata minimization [DeanGivan96] • Abstraction is a specific aggregation technique • aggregate by ignoring details (features) • ideally, focus on relevant features (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  16. 5.3 5.3 5.3 5.3 2.9 2.9 9.3 9.3 5.3 5.2 5.5 5.3 2.9 2.7 9.3 9.0 Dimensions of Abstraction Uniform Exact Adaptive A B C A B C A B C A B C A B C A B C A B C A B C Nonuniform Approximate Fixed A A A B B = A B C C A B C (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  17. A Fixed, Uniform Approximate Abstraction Method • Uniformly delete features from domain [BD94/AIJ97] • Ignore features based on degree of relevance • rep’n used to determine importance to sol’n quality • Allows tradeoff between abstract MDP size and solution quality 0.5 0.8 A B C A B C A B C A B C 0.5 0.2 A B C A B C (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  18. Immediately Relevant Variables • Rewards determined by particular variables • impact on reward clear from STRIPS/ADD rep’n of R • e.g., difference between CR/-CR states is 10, while difference between T/-T states is 3, MW/-MW is 5 • Approximate MDP: focus on “important” goals • e.g., we might only plan for CR • we call CR an immediately relevant variable (IR) • generally, IR-set is a subset of reward variables (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  19. Relevant Variables • We want to control the IR variables • must know which actions influence these and under what conditions • A variable is relevant if it is the parent in the DBN for some action a of some relevant variable • ground (fixed pt) definition by making IR vars relevant • analogous def’n for PSTRIPS • e.g., CR (directly/indirectly) influenced by L, RHC, CR • Simple “backchaining” algorithm to contruct set • linear in domain descr. size, number of relevant vars (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  20. Constructing an Abstract MDP • Simply delete all irrelevant atoms from domain • state space S’: set of assts to relevant vars • transitions: let Pr(s’,a,t’) = S t t’Pr(s,a,t’) for any ss’ • construction ensures identical for all ss’ • reward: R(s’) = max {R(s): ss’} - min {R(s): ss’} / 2 • midpoint gives tight error bounds • Construction of DBN/PSTRIPS with these properties involves little more than simplifying action descriptions by deletion (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  21. Example • Abstract MDP • only 3 variables • 20 states instead of 160 • some actions become identical, so action space is simplified • reward distinguishes only CR and –CR (but “averages” penalties for MW and –T) Lt Lt+1 CRt CRt+1 RHCt RHCt+1 DelC action Reward (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  22. Solving Abstract MDP • Abstract MDP can be solved using std methods • Error bounds on policy quality derivable • Let d be max reward span over abstract states • Let V’ be optimal VF for M’, V* for original M • Let p’ be optimal policy for M’ and p* for original M (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  23. FUA Abstraction: Relative Merits • FUA easily computed (fixed polynomial cost) • FUA prioritizes objectives nicely • a priori error bounds computable (anytime tradeoffs) • can refine online (heuristic search) [DeaBou97] • FUA is inflexible • can’t capture conditional relevance • approximate (may want exact solution) • can’t be adjusted during computation • may ignore the only achievable objectives (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  24. 5.3 5.3 5.3 5.3 2.9 2.9 9.3 9.3 5.3 5.2 5.5 5.3 2.9 2.7 9.3 9.0 Dimensions of Abstraction Uniform Exact Adaptive A B C A B C A B C A B C A B C A B C A B C A B C Nonuniform Approximate Fixed A A A B B = A B C C A B C (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  25. Constructing Abstract MDPs • Many ways to abstract an MDP • methods will exploit the logical representation • Abstraction can be viewed as a form of automaton minimization • general minimization schemes require state space enumeration • Instead, exploit the logical structure of the domain (state, actions, rewards) to construct logical descriptions of abstract states, avoiding state enumeration (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  26. Decision-Theoretic Regression • Abstraction based on analog of regression • as abstraction: dynamic, nonuniform, exact/approx. • exploits logical representation of MDP • Overview • value iteration as variable elimination • propositional decision-theoretic regression • approximate decision-theoretic regression • first-order decision-theoretic regression (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  27. Classical Regression • Goal regression a classical abstraction method • Regression of a logical condition/formula G through action a is a weakest logical formula C = Regr(G,a) such that: G is guaranteed to be true after doing a if C is true before doing a • Weakest precondition for G wrt a C G do(a) C G (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  28. Example: Regression in SitCalc • For the situation calculus • Regr(G(do(a,s))): logical condition C(s) under which a leads to G (aggregates C states and ~C states) • Regression in sitcalc straightforward • Regr(F(x, do(a,s))) F(x,a,s) • Regr(1)  Regr(1) • Regr(12) Regr(1)  Regr(2) • Regr(x.1)  x.Regr(1) (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  29. Decision-Theoretic Regression • In MDPs, we don’t have goals, but regions of distinct value • Decision-theoretic analog: given “logical description” of Vt+1, produce such a description of Vt or optimal policy (e.g., using ADDs) • Cluster together states at any point in calculation with same best action (policy), or with same value (VF) (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  30. p1 G2 p2 p3 C1 G1 G3 Decision-Theoretic Regression • Decision-theoretic complications: • multiple formulae G describe fixed value partitions • a can leads to multiple partitions (stochastically) Qt+1(a) Vt (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  31. RHMt RHMt+1 Mt Mt+1 Tt Tt+1 Lt Lt+1 CRt CRt+1 RHCt RHCt+1 Functional View of DTR • Generally, Vt+1 depends on only a subset of variables @ t (usually in a structured way) • What is value of action a at time t (at any s)? Vt+1 fRm(Rmt,Rmt+1) fM(Mt,Mt+1) CR fT(Tt,Tt+1) M fL(Lt,Lt+1) -10 0 fCr(Lt,Crt,Rct,Crt+1) fRc(Rct,Rct+1) (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  32. CR JC t RHC CP CR t f f L CC M e o CR(t+1) CR(t+1) CR(t+1) JP BC JP f 0 -10 t f f t t 0 0.0 1.0 0.2 0.8 10 9 12 Bellman Backup (Regression)  (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  33. A Simple Action/Reward Example W X X 0.0 1.0 X Z 0.9 Y Y Y 1.0 0.0 10 0 Y 0.9 Z Z Z 1.0 0.0 Network Rep’n for Action A Reward Function R (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  34. Y Y Y Y Y Y Z Z Z Z: 0.9 9.0 Z 8.1 8.1 8.1 8.1 Z Z Z Z 10 10 0 0 Z: 1.0 Z: 0.0 10.0 0.0 19.0 9.0 9.0 9.0 0.0 0.0 0.0 0.0 P(Z|a,s) P(Z|a,s)V0 Example: Generation of V1 P(Z|a,s)V0 V0 = R Maxa … = + R(s) +Maxa … = V1 (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  35. Example: Generation of V2 Y X X 8.1 Z Y Y: 0.9 Y Y Y: 0.0 Y: 1.0 9.0 0.0 Z Y: 1.0 Z Y: 0.9 Z: 0.9 Y:0.0 Z: 1.0 Y: 0.0 Z: 0.0 Y:0.9 Z: 1.0 Y: 0.9 Z: 0.0 V1 P(Y|a, s) P(Z|a,s) (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  36. Some Results: Natural Examples (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  37. Some Results: Worst-case (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  38. Some Results: Best-case (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  39. DTR: Relative Merits • Adaptive, nonuniform, exact abstraction method • provides exact solution to MDP • much more efficient on certain problems (time/space) • 400 million state problems (ADDs) in a couple hrs • Some drawbacks • produces piecewise constant VF • some problems admit no compact solution representation (though ADD overhead “minimal”) • approximation may be desirable or necessary (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  40. Criticisms of SPUDD (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  41. Future Work (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  42. 5.3 5.3 5.3 5.3 2.9 2.9 9.3 9.3 5.3 5.2 5.5 5.3 2.9 2.7 9.3 9.0 Dimensions of Abstraction Uniform Exact Adaptive A B C A B C A B C A B C A B C A B C A B C A B C Nonuniform Approximate Fixed A A A B B = A B C C A B C (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  43. Approximate DTR • Easy to approximate solution using DTR • Simple pruning of value function • Can prune trees [BouDearden96]or ADDs [StaubinHoeyBou00] • Gives regions of approximately same value (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  44. HCU HCR [9.00, 10.00] Loc Loc [5.19, 6.19] [7.45,8.45] [6.64, 7.64] A Pruned Value ADD HCU HCR W 9.00 10.00 W 5.19 R W W U 7.45 6.64 R R 6.19 5.62 U U 8.45 8.36 7.64 6.81 (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  45. Approximate Structured VI • Run normal SVI using ADDs/DTs • at each leaf, record range of values • At each stage, prune interior nodes whose leaves all have values with some threshold d • tolerance can be chosen to minimize error or size • tolerance can be adjusted to magnitude of VF • Convergence requires some care • If max span over leaves < d and term. tol. < e: (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  46. Approximate DTR: Relative Merits • Relative merits of ADTR • fewer regions implies faster computation • can provide leverage for optimal computation • 30-40 billion state problems in a couple hours • allows fine-grained control of time vs. solution quality with dynamic (a posteriori) error bounds • technical challenges: variable ordering, convergence, fixed vs. adaptive tolerance, etc. • Some drawbacks • (still) produces piecewise constant VF • doesn’t exploit additive structure of VF at all (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  47. Reachability (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  48. DP vs. heuristic search Each iteration, DP improves solution for each state DP solves problem for all possible starting states. Solution graph: all states reachable by optimal solution Explicit graph: states evaluated during search Implicit graph: all states Start state Given a start state, heuristic search can find an optimal solution without evaluating all states. (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  49. Solution structures Cyclic solution graph Solution path Acyclic solution graph (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

  50. DP vs. heuristic search Heuristic search = dynamic programming + starting state + forward expansion of solution + admissible heuristic (c) 2002-3, C. Boutilier, E. Hansen, D. Weld

More Related