1 / 58

# MDP Problems and Exact Solutions I - PowerPoint PPT Presentation

MDP Problems and Exact Solutions I. Ryan Christiansen Department of Mechanical Engineering and Materials Science Rice University Slides adapted from Mausam and Andrey Kolobov. MDP Problems: At a Glance. MDP Definition (2.1) Solutions of an MDP (2.2) Solution Existence (2.3)

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' MDP Problems and Exact Solutions I' - alaqua

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

• Ryan Christiansen

• Department of Mechanical Engineering and Materials Science

• Rice University

• Slides adapted from Mausam and Andrey Kolobov

• MDP Definition (2.1)

• Solutions of an MDP (2.2)

• Solution Existence (2.3)

• Stochastic Shortest-Path MDPs (2.4)

• Complexity of Solving MDPs (2.6)

• MDP: an MDP is a tuple <S, A, D, T, R>

• S is a finite state space

• A is a finite action set

• D is a sequence of discrete decision epochs (time steps)

• T: S x A x S x D → [0, 1] is a transition function (probability)

• R: S x A x S x D → ℝis a reward function

• How does an MDP problem work?

• Initial Conditions: the starting state

• Actions: actions are chosen at each decision epoch to traverse the MDP

• Termination: reach a terminating state or the final decision epoch

• The goal is to end up with the highest net reward at termination

• Policy: a rule for choosing actions

• Global/Complete: a policy must always be applicable for the entire MDP

• In general, policies will be

• Probabilistic: able to choose between multiple actions randomly

• History-Dependent: able to utilize the execution history, or the set of state and action pairs previously traversed

• π: H x A → [0, 1]

• Markovian Policy: a history-dependent policy that only depends on the current state and time step

• For any two histories hs,tand h′s,tboth of which end at the same state s and timestep t, and for any action a, a Markovian policy π will satisfy π(hs,t, a) = π(h′s,t, a)

• In practice, it functions as a history-independent policy

• π: S x D x A → [0, 1]

• For several important types of MDPs, at least one optimal solution is necessarily Markovian

• Stationary Markovian Policy: a Markovian policy that does not depend on time

• For any two timesteps t1and t2and state s, and for any action a, a stationary Markovian policy π will satisfy π(s, t1, a) = π(s, t2, a)

• π: S x A → [0, 1]

• Value Function: a function mapping the domain of the policy excluding the action set to a scalar value.

• History dependent: V: H → [-∞, ∞]

• Markovian: V: S x D → [-∞, ∞]

• Stationary Markovian: V: S → [-∞, ∞]

• Value Function of a Policy: the utility function of the reward sequence returned from executing the policy, or the total utility of the total reward

• Vπ(hs,t) = u(Rtπhs,t, Rt+1πhs,t, …)

• A solution to an MDP is an optimal policy, or a policy that maximizes utility.

• Policy π* is optimal if the value function, V*, is greater than or equal to the value function of any other policy.

• V*(h) ≥ Vπ(h) for all h and π

• Need to be careful when defining utility u(R1, R2, …)

• For the same h, utility can be different across policy executions

• Existence and uniqueness are not guaranteed for many types of MDPs.

• u(R1, R2, …) = E(R1 + γR2 + γ2R3 …)where γis the discount factor

• Assume γ = 1 unless stated otherwise

• 0 ≤ γ < 1 : more immediate rewards are more valuable

• γ = 1 : rewards are equally valuable, independently of time

• γ > 1 : more distant rewards are more valuable

• The Optimality Principle: if every policy’s quality can be measured by this policy’s ELAU, there exists a policy that is optimal at every timestep

• There are some situations where it may not apply:

• When stuck in a repeating sequence of states (a loop)

• Infinite decision epochs

• Infinite utility

• Oscillating Utility

• Unbounded Utility

• Risk averse, or risk taking (ELAU is risk neutral)

• \$1 million guaranteed (risk averse)

• 50% chance of \$2 million, 50% chance of \$0 (risk taking)

• Expected value is the same, so risk neutral would choose either

• Finite-horizon MDPs

• Infinite-horizon discounted-reward MDPs

• Stochastic shortest-path MDPs

• Assume the agent acts for a finite # of time steps, L

• Example applications:

• Inventory management“How much X to order from the supplier every day ‘til the end of the season?”

• Maintenance scheduling“When to schedule disruptive maintenance jobs by their deadline?”

Puterman, 1994

• FH MDP: an FH MDP is a tuple <S, A, D, T, R>

• S is a finite state space

• A is a finite action set

• D is a sequence of discrete decision epochs (time steps) up to a finite horizon L

• T: S x A x S x D → [0, 1] is a transition function (probability)

• R: S x A x S x D → ℝis a reward function

• For an FH MDP with horizon |D| = L < ∞, let:

• Vπ(hs,t) = Eπhs,t[R1 + … + RL – t] for all 1 ≤ t ≤ L

• Vπ(hs,L +1) = 0

• Then

• V* exists and is Markovian, π* exists and is det. Markovian

• For all sand1 ≤ t ≤ L:V*(s,t) = maxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ] π*(s,t) = argmaxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ]

• For an FH MDP with horizon |D| = L < ∞, let:

• Vπ(hs,t) = Eπhs,t[R1 + … + RL – t] for all 1 ≤ t ≤ L

• Vπ(hs,L +1) = 0

• Then

• V* exists and is Markovian, π* exists and is det. Markovian

• For all sand1 ≤ t ≤ L: V*(s,t) = maxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ] π*(s,t) = argmaxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ]

ELAU

For every history, the value of every policy is well-defined

}

}

Each E[Ri] is finite

}

# of terms in the series is finite

• For an FH MDP with horizon |D| = L < ∞, let:

• Vπ(hs,t) = Eπhs,t[R1 + … + RL – t] for all 1 ≤ t ≤ L

• Vπ(hs,L +1) = 0

• Then

• V* exists and is Markovian, π* exists and is det. Markovian

• For all sand1 ≤ t ≤ L: V*(s,t) = maxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ] π*(s,t) = argmaxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ]

ELAU

For every history, the value of every policy is well-defined

}

}

Each E[Ri] is finite

}

# of terms in the series is finite

Immediate utility of the next action

If you act optimally now

{

{

}

}

}

Highest utility derivable from the next state

Highest utility derivable from s at time t

In expectation

• If V* and π* are Markovian, then we only need to consider Markovian V and π

• Easy to compute π*

• For all s, compute V*(s, t) and π*(s, t) for t = L, …, 1

• Why go beyond the finite horizon?

• Autonomous agents with long lifespans (elevators, investments, airplanes, etc.)

• Infinite Horizon

• Known to be infinite (can continue indefinitely)

• Indefinite Horizon

• Known to be unbounded (finite processes that can be delayed or extended but will eventually reach a terminal state)

• Due to the infinite nature of D, we must define stationary, or time independent functions:

• T: S x A x S → [0, 1] is a transition function (probability)

• R: S x A x S → ℝis a reward function

• π: S → A (this is also Markovian)

• V: S → [-∞, ∞] (this is also Markovian)

• IHDR MDP: an IHDR MDP is a tuple <S, A, T, R, γ>

• S is a finite state space

• A is a finite action set

• T: S x A x S → [0, 1] is a transition function (probability)

• R: S x A x S → ℝis a reward function

• γ is a discount factor between 0 and 1 (favors immediate rewards)

• Policy value = discounted ELAU over infinite time steps

• For an IHDR MDP, let:

• Vπ(h) = Eπh[R1 + γR2 + γ2R3 …] for all h

• Then

• V* exists and is stationary Markovian, π* exists and is stationary deterministic Markovian

• For all s:V*(s) = maxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ] π*(s) = argmaxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ]

• For an IHDR MDP, let:

• Vπ(h) = Eπh[R1 + γR2 + γ2R3 …] for all h

• Then

• V* exists and is stationary Markovian, π* exists and is stationary deterministic Markovian

• For all s:V*(s) = maxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ] π*(s) = argmaxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ]

ELAU

For every history, the value of a policy is well-defined thanks to 0 ≤ γ < 1

}

}

}

All γE[Ri] are bounded by some finite K and converge geometrically

• For an IHDR MDP, let:

• Vπ(h) = Eπh[R1 + γR2 + γ2R3 …] for all h

• Then

• V* exists and is stationary Markovian, π* exists and is stationary deterministic Markovian

• For all s:V*(s) = maxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ] π*(s) = argmaxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ]

ELAU

For every history, the value of a policy is well-defined thanks to 0 ≤ γ < 1

}

}

}

All γE[Ri] are bounded by some finite K and converge geometrically

Future utility is discounted

Optimal utility is time independent

{

{

• If V* and π* are stationary Markovian, then we only need to consider stationary Markovian V and π

The Meaning of Principleγ

• γcan affect optimal policy significantly

• γ = 0 + ε: yields myopic policies for impatient agents

• γ = 1 - ε: yields far-sighted policies, inefficient to compute

• How to set it?

• Sometimes suggested by data (inflation rate, interest rate, tax rate)

• Often set arbitrarily to a value that gives a reasonable policy

• Assume the agent pays a cost to achieve a goal

• Example applications:

• Controlling a Mars rover“How to collect scientific data without damaging the rover?”

• Navigation“What’s the fastest way to get to a destination, taking into account the traffic jams?”

• Cost is often time or a physical resource

• SSP MDP: an SSP MDP is a tuple <S, A, T, C, G>

• S is a finite state space

• A is a finite action set

• T: S x A x S → [0, 1] is a stationary transition function (probability)

• C: S x A x S → ℝis a stationary cost function (R = -C)

• G ⊆ S is a set of absorbing cost-free goal states

• Under two conditions:

• There is at least one proper policy (reaches goal with P = 1 from all states)

• Every improper policy incurs a cost of infinity from every state from which it does not reach the goal with P = 1

SSP MDP Details Principle

• In SSP, maximizing ELAU = minimizing expected cost

• Every cost-minimizing policy is proper

• Thus, an optimal policy is the cheapest way to reach the goal

• Why are SSP MDPs called “indefinite-horizon?”

• If a policy is optimal, it will take a finite, but a priori unknown time to reach the goal (time to goal is dependent on the evaluation of the policy).

• At the limit as t approaches infinity, the probability that a goal state has been reached approaches P = 1

SSP MDP Example Principle

SSP MDP Example Principle

SSP MDP Example Principle

SSP MDP Example Principle

SSP MDP Example Principle

SSP MDP Example Principle

SSP MDPs: Optimality Principle Principle

• For an SSP MDP, let:

• Vπ(h) = Eπh[C1 + C2 + …] for all h

• Then

• V* exists and is stationary Markovian, π* exists and is stationary deterministic Markovian

• For all s:V*(s) = maxa in A [Σs′ in S T(s, a, s′) [ C(s, a, s′) + V*(s′) ] ] π*(s) = argmaxa in A [Σs′ in S T(s, a, s′) [ C(s, a, s′) + V*(s′) ] ]

SSP MDPs: Optimality Principle Principle

• For an SSP MDP, let:

• Vπ(h) = Eπh[C1 + C2 + …] for all h

• Then

• V* exists and is stationary Markovian, π* exists and is stationary deterministic Markovian

• For all s:V*(s) = maxa in A [Σs′ in S T(s, a, s′) [ C(s, a, s′) + V*(s′) ] ] π*(s) = argmaxa in A [Σs′ in S T(s, a, s′) [ C(s, a, s′) + V*(s′) ] ]

ELAU

For every history, the value of a policy is well-defined

}

Every policy either takes a finite exp. # of steps to reach a goal, or has infinite cost

The MDP Hierarchy Principle

• FH => SSP: turn all states (s, L) into goals

• IHDR => SSP: add (1 – γ)-probability transitions to goal

• Focusing on SSP allows us to develop one set of algorithms to solve all three classes of MDPs.

• We are only concerned with using flat representation

• This is the name for the representation already introduced on the definition slides

• It is easier to solve MDPs in flat representation, and it is much easier to describe larger MDPs in flat representation

• If you are interested in factored representation, read Section 2.5

Computational Complexity of MDPs Principle

• Solving IHDR, SSP in flat representation is P-complete

• Solving FH in flat representation is P-hard

• They don’t benefit from parallelization, but are solvable in polynomial time

MDP Exact Solutions I: At a Glance Principle

• Brute-Force Alogrithm (3.1)

• Policy Evaluation (3.2)

Brute Force Algorithm Principle

• Go over all policies π

• How many? |A||S|, a finite amount

• Evaluate each policy

• Vπ(s), the expected cost of reaching the goal from s

• Choose the best, π*

• SSP optimality principle tells us that a best exists

• Vπ*(s) ≤ Vπ(s)

Policy Evaluation Principle

• Given a policy π, compute Vπ

• To start out, assume that π is proper

• Execution of π reaches a goal from any state

Deterministic SSPs Principle

• Policy graph for π

• π(s0) = a0; π(s1) = a1

• Vπ(s1) = 1

• Vπ(s0) = 5 + 1 = 6

Acyclic SSPs Principle

• Policy graph for π

• Vπ(s1) = 1

• Vπ(s2) = 4

• Vπ(s0) = 0.6(5 + 1) + 0.4(2 + 4) = 6

Cyclic SSPs Principle

• Policy graph for π

• Vπ(s1) = 1

• Vπ(s2) = 0.7(4) + 0.3(3 + Vπ(s0))

• Vπ(s0) = 0.6(5 + 1) + 0.4(2 + 0.7(4) + 0.3(3 + Vπ(s0)))

Cyclic SSPs Principle

• Generalized system of equations

• Vπ(sg) = 0

• Vπ(s1) = 1 + Vπ(sg)

• Vπ(s2) = 0.7(4 + Vπ(sg)) + 0.3(3 + Vπ(s0))

• Vπ(s0) = 0.6(5 + Vπ(s1)) + 0.4(2 + Vπ(s2))

• Constructing the system of equations

• Vπ(s) = 0 if s ∈ G

• Vπ(s) = ∑s′ ∈ S T(s, π(s), s′) [C(s, π(s), s′) + Vπ(s′)]

• |S| variables

• O(|S|3) running time

Policy Evaluation with Iteration Principle

• Vnπ(s) ← ∑s′ ∈ S T(s, π(s), s′) [C(s, π(s), s′) + Vn-1π(s′)]

• Iterative solution

• Vπ(s) = ∑s′ ∈ S T(s, π(s), s′) [C(s, π(s), s′) + Vπ(s′)]

• System of equations, for comparison

Convergence and Optimality Principle

• For a proper policy π,Iterative policy evaluation converges to the true value of the policy, i.e.limn→∞ Vnπ = Vπirrespective of the initialization V0π

Termination and Error Bounds Principle

• Residual: the magnitude of the change in the value of state s at iteration n in the algorithm.

• residualn(s) = | Vnπ(s) – Vn-1π(s)|

• residualn = maxs ∈ S residualn(s)

• ϵ-consistency: the residual of the value function in iteration n + 1 is less than ϵ

• A state s is ϵ-consistent if the value function is ϵ-consistent at s

• When iterative policy evaluation is run to ϵ-consistency, then Vnπ satisfies the following inequality:∀s ∈ S: | Vnπ(s) – Vπ(s)| < ϵN π(s)Where N π(s) is the expected number of steps to reach goal g from s by following π

Each iteration is O(|S|2) time

MDP Problems: At a Glance Principle

• MDP Definition (2.1)

• Solutions of an MDP (2.2)

• Solution Existence (2.3)

• Stochastic Shortest-Path MDPs (2.4)

• Complexity of Solving MDPs (2.6)

MDP Exact Solutions I: At a Glance Principle

• Brute-Force Alogrithm (3.1)

• Policy Evaluation (3.2)

Summary Principle

• General components of an MDP

• Three types of MDPs (FH, IHDR, SSP)

• Utility and cost (ELAU)

• Ways to solve MDPs (Sys of equations, Iterative)