1 / 35

Introduction of MDP

Introduction of MDP. Speaker : Xu Jia-Hao Adviser : Ke Kai-Wei. Outline. Simple Markov Process Markov Process with Reward Introduction of Alternative The Policy-Iteration Method for the Solution of Sequential Decision Processes Conclusion Reference. Outline. Simple Markov Process

Download Presentation

Introduction of MDP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction of MDP Speaker:Xu Jia-Hao Adviser:Ke Kai-Wei

  2. Outline • Simple Markov Process • Markov Process with Reward • Introduction of Alternative • The Policy-Iteration Method for the Solution of Sequential Decision Processes • Conclusion • Reference

  3. Outline • Simple Markov Process • Markov Process with Reward • Introduction of Alternative • The Policy-Iteration Method for the Solution of Sequential Decision Processes • Conclusion • Reference

  4. Simple Markov process ( 1 ) • Definition: The probability of a transition to state j during the next time interval, given that the system now occupies state i, is a function only of i and j and not of any history of the system before its arrival in i. • Example: Frog in a lily pond.

  5. Simple Markov process ( 2 ) • We may specify a set of conditional probabilities that a system which now occupies state i will occupy state j after its net transition.

  6. Toymaker Example ( 1 ) • Has two states:( discrete-time MP ) First state:The toy he is currently producing has found great favor with the public. Second state:his toy is out of favor. • Define state probability the probability that the system will occupy state i after n transitions if its state at n = 0 is known.

  7. Toymaker Example ( 2 ) • Transition Matrix: • Transition Diagram: 1 2

  8. Outline • Simple Markov Process • Markov Process with Reward • Introduction of Alternative • The Policy-Iteration Method for the Solution of Sequential Decision Processes • Conclusion • Reference

  9. Markov Processes with Rewards • Definition: Suppose that an N-state Markov process earns “dollars” when it makes a transition from state i to state j. • We call (may be negative) the “reward” associated with the transition from i to j. Reward matrix R. • Example: Frog in a lily pond.

  10. Definition •    : The expected total earnings in the next n transitions if the system is now in state i. • : The reward to be expected in the next transition out of state i ; it will be called the expected immediate reward for state i.

  11. Formula of :

  12. Toymaker with Reward Example ( 1 ) • Reward matrix: • Transition matrix: • Can find

  13. Toymaker with Reward Example ( 2 )

  14. Toymaker with Reward Example ( 3 )

  15. Outline • Simple Markov Process • Markov Process with Reward • Introduction of Alternative • The Policy-Iteration Method for the Solution of Sequential Decision Processes • Conclusion • Reference

  16. Introduction of Alternatives • The Concept of alternative for an N-state system.

  17. The constrain of Alternative • The number of alternatives in any state must be finite, but the number of alternatives in each state may be different from the numbers in other states.

  18. Concepts of Decision ( 1 ) • : The number of the alternative in the ith state that will be used at stage n. • We call the “decision” in state i at the nth stage. • When has been specified for all i and all n, a “policy” has been determined. The optimal policy is the one that maximizes total expected return for each i and n.

  19. Concepts of Decision ( 2 ) • Here redefine as the total expected return in n stages starting from state iif an optimal policy is followed.

  20. The Toymaker’s Problem Solved by Value Iteration ( 1 )

  21. The Toymaker’s Problem Solved by Value Iteration ( 2 )

  22. Value-Iteration • The method that has just been described for the solution of the sequential process may be called the value-iteration method because the or “values” are determined iteratively. • Even if we were patient enough to solve the long-duration process by value iteration, the convergence on the best alternative in each state is asymptotic and difficult to measure analytically.

  23. Outline • Simple Markov Process • Markov Process with Reward • Introduction of Alternative • The Policy-Iteration Method for the Solution of Sequential Decision Processes • Conclusion • Reference

  24. Policy

  25. Decision and Policy • The alternative thus selected is called the “decision” for that state; it is no longer a function of n. • The set of X’s or the set of decisions for all states is called a “policy”. • Selection of a policy thus determines the Markov process with rewards that will describe the operations of the system.

  26. Problems of finding optimal policies • In previous figure, there are 4x3x2x1x5=120 different policies. It is still conceivable, but it becomes unfeasible for very large problems. • For example:a problem with 50 states and 50 alternatives in each state contains • Solution: It is composed of two parts, the value-determination operation and the policy-improvement routine.

  27. Formula • The Value-Determination Operation : • The Policy-Improvement Routine:

  28. The Iteration Cycle

  29. The Toymaker’s Problem • Two Alternatives: 1、 2、 • Four Possible Policies:

  30. Result

  31. Outline • Simple Markov Process • Markov Process with Reward • Introduction of Alternative • The Policy-Iteration Method for the Solution of Sequential Decision Processes • Conclusion • Reference

  32. Conclusion • In Markov process, We can use the “Iteration Cycle” to select an optimal policy in order to earn the maximum reward.

  33. Outline • Simple Markov Process • Markov Process with Reward • Introduction of Alternative • The Policy-Iteration Method for the Solution of Sequential Decision Processes • Conclusion • Reference

  34. Reference • An introduction to Markov decision processes --Ronald A. Howard, Dynamic Programming and Markov Processes, MIT Press, 1960

  35. z-Transform • For the study of transient behavior and for theoretical convenience, it is useful to study the Markov process from the point of view of the generating function or, as we shall call it ,the z-Transform. • The relationship between time function f(n) and its transform F(z) is unique; each time function has only one transform, and the inverse transformation of the transform will produce once more the original time function.

More Related