Introduction of MDP

Introduction of MDP Speaker：Xu Jia-Hao Adviser：Ke Kai-Wei

Outline • Simple Markov Process • Markov Process with Reward • Introduction of Alternative • The Policy-Iteration Method for the Solution of Sequential Decision Processes • Conclusion • Reference

Simple Markov process ( 1 ) • Definition： The probability of a transition to state j during the next time interval, given that the system now occupies state i, is a function only of i and j and not of any history of the system before its arrival in i. • Example： Frog in a lily pond.

Simple Markov process ( 2 ) • We may specify a set of conditional probabilities that a system which now occupies state i will occupy state j after its net transition.

Toymaker Example ( 1 ) • Has two states：( discrete-time MP ) First state：The toy he is currently producing has found great favor with the public. Second state：his toy is out of favor. • Define state probability the probability that the system will occupy state i after n transitions if its state at n = 0 is known.

Toymaker Example ( 2 ) • Transition Matrix： • Transition Diagram： 1 2

Markov Processes with Rewards • Definition： Suppose that an N-state Markov process earns “dollars” when it makes a transition from state i to state j. • We call (may be negative) the “reward” associated with the transition from i to j. Reward matrix R. • Example： Frog in a lily pond.

Definition • 　　： The expected total earnings in the next n transitions if the system is now in state i. • ： The reward to be expected in the next transition out of state i ; it will be called the expected immediate reward for state i.

Formula of ：

Toymaker with Reward Example ( 1 ) • Reward matrix： • Transition matrix： • Can find

Toymaker with Reward Example ( 2 )

Toymaker with Reward Example ( 3 )

Introduction of Alternatives • The Concept of alternative for an N-state system.

The constrain of Alternative • The number of alternatives in any state must be finite, but the number of alternatives in each state may be different from the numbers in other states.

Concepts of Decision ( 1 ) • ： The number of the alternative in the ith state that will be used at stage n. • We call the “decision” in state i at the nth stage. • When has been specified for all i and all n, a “policy” has been determined. The optimal policy is the one that maximizes total expected return for each i and n.

Concepts of Decision ( 2 ) • Here redefine as the total expected return in n stages starting from state iif an optimal policy is followed.

The Toymaker’s Problem Solved by Value Iteration ( 1 )

The Toymaker’s Problem Solved by Value Iteration ( 2 )

Value-Iteration • The method that has just been described for the solution of the sequential process may be called the value-iteration method because the or “values” are determined iteratively. • Even if we were patient enough to solve the long-duration process by value iteration, the convergence on the best alternative in each state is asymptotic and difficult to measure analytically.

Policy

Decision and Policy • The alternative thus selected is called the “decision” for that state; it is no longer a function of n. • The set of X’s or the set of decisions for all states is called a “policy”. • Selection of a policy thus determines the Markov process with rewards that will describe the operations of the system.

Problems of finding optimal policies • In previous figure, there are 4x3x2x1x5=120 different policies. It is still conceivable, but it becomes unfeasible for very large problems. • For example：a problem with 50 states and 50 alternatives in each state contains • Solution： It is composed of two parts, the value-determination operation and the policy-improvement routine.

Formula • The Value-Determination Operation ： • The Policy-Improvement Routine：

The Iteration Cycle

The Toymaker’s Problem • Two Alternatives： 1、 2、 • Four Possible Policies：

Result

Conclusion • In Markov process, We can use the “Iteration Cycle” to select an optimal policy in order to earn the maximum reward.

Reference • An introduction to Markov decision processes --Ronald A. Howard, Dynamic Programming and Markov Processes, MIT Press, 1960

z-Transform • For the study of transient behavior and for theoretical convenience, it is useful to study the Markov process from the point of view of the generating function or, as we shall call it ,the z-Transform. • The relationship between time function f(n) and its transform F(z) is unique; each time function has only one transform, and the inverse transformation of the transform will produce once more the original time function.

Introduction of MDP

Introduction of MDP

Presentation Transcript

KI2 – MDP / POMDP

MDP

MDP

MDP(Management development program)

MEKONG DEVELOPMENT PROGRAM (MDP)

Media Delivery Platform (MDP)

Advanced MDP Topics

MDP Reinforcement Learning

Markov Decision Process (MDP)

An Introduction to PO-MDP

Markov Decision Process (MDP)

Partially Observable MDP

MDP 301

Case for MDP Retreat

MDP