Mdp problems and exact solutions i
Sponsored Links
This presentation is the property of its rightful owner.
1 / 58

MDP Problems and Exact Solutions I PowerPoint PPT Presentation


  • 166 Views
  • Uploaded on
  • Presentation posted in: General

MDP Problems and Exact Solutions I. Ryan Christiansen Department of Mechanical Engineering and Materials Science Rice University Slides adapted from Mausam and Andrey Kolobov. MDP Problems: At a Glance. MDP Definition (2.1) Solutions of an MDP (2.2) Solution Existence (2.3)

Download Presentation

MDP Problems and Exact Solutions I

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


MDP Problems and Exact Solutions I

  • Ryan Christiansen

  • Department of Mechanical Engineering and Materials Science

  • Rice University

  • Slides adapted from Mausam and Andrey Kolobov


MDP Problems: At a Glance

  • MDP Definition (2.1)

  • Solutions of an MDP (2.2)

  • Solution Existence (2.3)

  • Stochastic Shortest-Path MDPs (2.4)

  • Complexity of Solving MDPs (2.6)


MDP Definition

  • MDP: an MDP is a tuple <S, A, D, T, R>

    • S is a finite state space

    • A is a finite action set

    • D is a sequence of discrete decision epochs (time steps)

    • T: S x A x S x D → [0, 1] is a transition function (probability)

    • R: S x A x S x D → ℝis a reward function


An MDP Problem

  • How does an MDP problem work?

    • Initial Conditions: the starting state

    • Actions: actions are chosen at each decision epoch to traverse the MDP

    • Termination: reach a terminating state or the final decision epoch

  • The goal is to end up with the highest net reward at termination


The Policy

  • Policy: a rule for choosing actions

    • Global/Complete: a policy must always be applicable for the entire MDP

  • In general, policies will be

    • Probabilistic: able to choose between multiple actions randomly

    • History-Dependent: able to utilize the execution history, or the set of state and action pairs previously traversed

    • π: H x A → [0, 1]


Markovian Policy

  • Markovian Policy: a history-dependent policy that only depends on the current state and time step

    • For any two histories hs,tand h′s,tboth of which end at the same state s and timestep t, and for any action a, a Markovian policy π will satisfy π(hs,t, a) = π(h′s,t, a)

    • In practice, it functions as a history-independent policy

    • π: S x D x A → [0, 1]

  • For several important types of MDPs, at least one optimal solution is necessarily Markovian


Stationary Markovian Policy

  • Stationary Markovian Policy: a Markovian policy that does not depend on time

    • For any two timesteps t1and t2and state s, and for any action a, a stationary Markovian policy π will satisfy π(s, t1, a) = π(s, t2, a)

    • π: S x A → [0, 1]


Evaluate a Policy with the Value Function

  • Value Function: a function mapping the domain of the policy excluding the action set to a scalar value.

    • History dependent: V: H → [-∞, ∞]

    • Markovian: V: S x D → [-∞, ∞]

    • Stationary Markovian: V: S → [-∞, ∞]

  • Value Function of a Policy: the utility function of the reward sequence returned from executing the policy, or the total utility of the total reward

    • Vπ(hs,t) = u(Rtπhs,t, Rt+1πhs,t, …)


Solutions of an MDP

  • A solution to an MDP is an optimal policy, or a policy that maximizes utility.

  • Policy π* is optimal if the value function, V*, is greater than or equal to the value function of any other policy.

    • V*(h) ≥ Vπ(h) for all h and π

  • Need to be careful when defining utility u(R1, R2, …)

    • For the same h, utility can be different across policy executions

  • Existence and uniqueness are not guaranteed for many types of MDPs.


Expected Linear Additive Utility (ELAU)

  • u(R1, R2, …) = E(R1 + γR2 + γ2R3 …)where γis the discount factor

  • Assume γ = 1 unless stated otherwise

    • 0 ≤ γ < 1 : more immediate rewards are more valuable

    • γ = 1 : rewards are equally valuable, independently of time

    • γ > 1 : more distant rewards are more valuable


The Optimality Principle

  • The Optimality Principle: if every policy’s quality can be measured by this policy’s ELAU, there exists a policy that is optimal at every timestep

  • There are some situations where it may not apply:

    • When stuck in a repeating sequence of states (a loop)

    • Infinite decision epochs

    • Infinite utility


The Optimality Principle does not Hold

  • Oscillating Utility

  • Unbounded Utility


Further Utility Considerations

  • Risk averse, or risk taking (ELAU is risk neutral)

    • $1 million guaranteed (risk averse)

    • 50% chance of $2 million, 50% chance of $0 (risk taking)

    • Expected value is the same, so risk neutral would choose either


3 Models with Well-Defined Policy ELAU

  • Finite-horizon MDPs

  • Infinite-horizon discounted-reward MDPs

  • Stochastic shortest-path MDPs


Finite-Horizon MDPs: Motivation

  • Assume the agent acts for a finite # of time steps, L

  • Example applications:

    • Inventory management“How much X to order fromthe supplier every day ‘tilthe end of the season?”

    • Maintenance scheduling“When to schedule disruptive maintenance jobs by their deadline?”


Finite-Horizon MDPs: Definition

Puterman, 1994

  • FH MDP: an FH MDP is a tuple <S, A, D, T, R>

    • S is a finite state space

    • A is a finite action set

    • D is a sequence of discrete decision epochs (time steps) up to a finite horizon L

    • T: S x A x S x D → [0, 1] is a transition function (probability)

    • R: S x A x S x D → ℝis a reward function


Finite-Horizon MDPs: Optimality Principle

  • For an FH MDP with horizon |D| = L < ∞, let:

    • Vπ(hs,t) = Eπhs,t[R1 + … + RL – t] for all 1 ≤ t ≤ L

    • Vπ(hs,L +1) = 0

  • Then

    • V* exists and is Markovian, π* exists and is det. Markovian

    • For all sand1 ≤ t ≤ L:V*(s,t) = maxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ] π*(s,t) = argmaxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ]


Finite-Horizon MDPs: Optimality Principle

  • For an FH MDP with horizon |D| = L < ∞, let:

    • Vπ(hs,t) = Eπhs,t[R1 + … + RL – t] for all 1 ≤ t ≤ L

    • Vπ(hs,L +1) = 0

  • Then

    • V* exists and is Markovian, π* exists and is det. Markovian

    • For all sand1 ≤ t ≤ L: V*(s,t) = maxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ] π*(s,t) = argmaxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ]

ELAU

For every history, the value of every policy is well-defined

}

}

Each E[Ri] is finite

}

# of terms in the series is finite


Finite-Horizon MDPs: Optimality Principle

  • For an FH MDP with horizon |D| = L < ∞, let:

    • Vπ(hs,t) = Eπhs,t[R1 + … + RL – t] for all 1 ≤ t ≤ L

    • Vπ(hs,L +1) = 0

  • Then

    • V* exists and is Markovian, π* exists and is det. Markovian

    • For all sand1 ≤ t ≤ L: V*(s,t) = maxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ] π*(s,t) = argmaxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ]

ELAU

For every history, the value of every policy is well-defined

}

}

Each E[Ri] is finite

}

# of terms in the series is finite

Immediate utility of the next action

If you act optimally now

{

{

}

}

}

Highest utility derivable from the next state

Highest utility derivable from s at time t

In expectation


Perks of the FH MDP Optimality Principle

  • If V* and π* are Markovian, then we only need to consider Markovian V and π

  • Easy to compute π*

    • For all s, compute V*(s, t) and π*(s, t) for t = L, …, 1


To Infinity and Beyond

  • Why go beyond the finite horizon?

    • Autonomous agents with long lifespans (elevators, investments, airplanes, etc.)

  • Infinite Horizon

    • Known to be infinite (can continue indefinitely)

  • Indefinite Horizon

    • Known to be unbounded (finite processes that can be delayed or extended but will eventually reach a terminal state)


Analyzing MDPs with In(de)finite Horizon

  • Due to the infinite nature of D, we must define stationary, or time independent functions:

    • T: S x A x S → [0, 1] is a transition function (probability)

    • R: S x A x S → ℝis a reward function

    • π: S → A (this is also Markovian)

    • V: S → [-∞, ∞] (this is also Markovian)


Infinite-Horizon Discounted-Reward MDPs: Definition

  • IHDR MDP: an IHDR MDP is a tuple <S, A, T, R, γ>

    • S is a finite state space

    • A is a finite action set

    • T: S x A x S → [0, 1] is a transition function (probability)

    • R: S x A x S → ℝis a reward function

    • γ is a discount factor between 0 and 1 (favors immediate rewards)

  • Policy value = discounted ELAU over infinite time steps


Infinite-Horizon Discounted-Reward MDPs: Optimality Principle

  • For an IHDR MDP, let:

    • Vπ(h) = Eπh[R1 + γR2 + γ2R3 …] for all h

  • Then

    • V* exists and is stationary Markovian, π* exists and is stationary deterministic Markovian

    • For all s:V*(s) = maxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ] π*(s) = argmaxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ]


Infinite-Horizon Discounted-Reward MDPs: Optimality Principle

  • For an IHDR MDP, let:

    • Vπ(h) = Eπh[R1 + γR2 + γ2R3 …] for all h

  • Then

    • V* exists and is stationary Markovian, π* exists and is stationary deterministic Markovian

    • For all s:V*(s) = maxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ] π*(s) = argmaxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ]

ELAU

For every history, the value of a policy is well-defined thanks to 0 ≤ γ < 1

}

}

}

All γE[Ri] are bounded by some finite K and converge geometrically


Infinite-Horizon Discounted-Reward MDPs: Optimality Principle

  • For an IHDR MDP, let:

    • Vπ(h) = Eπh[R1 + γR2 + γ2R3 …] for all h

  • Then

    • V* exists and is stationary Markovian, π* exists and is stationary deterministic Markovian

    • For all s:V*(s) = maxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ] π*(s) = argmaxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ]

ELAU

For every history, the value of a policy is well-defined thanks to 0 ≤ γ < 1

}

}

}

All γE[Ri] are bounded by some finite K and converge geometrically

Future utility is discounted

Optimal utility is time independent

{

{


Perks of the IHDR MDP Optimality Principle

  • If V* and π* are stationary Markovian, then we only need to consider stationary Markovian V and π


The Meaning of γ

  • γcan affect optimal policy significantly

    • γ = 0 + ε: yields myopic policies for impatient agents

    • γ = 1 - ε: yields far-sighted policies, inefficient to compute

  • How to set it?

    • Sometimes suggested by data (inflation rate, interest rate, tax rate)

    • Often set arbitrarily to a value that gives a reasonable policy


Stochastic Shortest-Path MDPs: Motivation

  • Assume the agent pays a cost to achieve a goal

  • Example applications:

    • Controlling a Mars rover“How to collect scientific data without damaging the rover?”

    • Navigation“What’s the fastest wayto get to a destination, takinginto account the traffic jams?”

  • Cost is often time or a physical resource


Stochastic Shortest-Path MDPs : Definition

  • SSP MDP: an SSP MDP is a tuple <S, A, T, C, G>

    • S is a finite state space

    • A is a finite action set

    • T: S x A x S → [0, 1] is a stationary transition function (probability)

    • C: S x A x S → ℝis a stationary cost function (R = -C)

    • G ⊆ S is a set of absorbing cost-free goal states

  • Under two conditions:

    • There is at least one proper policy (reaches goal with P = 1 from all states)

    • Every improper policy incurs a cost of infinity from every state from which it does not reach the goal with P = 1


SSP MDP Details

  • In SSP, maximizing ELAU = minimizing expected cost

  • Every cost-minimizing policy is proper

  • Thus, an optimal policy is the cheapest way to reach the goal

  • Why are SSP MDPs called “indefinite-horizon?”

    • If a policy is optimal, it will take a finite, but a priori unknown time to reach the goal (time to goal is dependent on the evaluation of the policy).

    • At the limit as t approaches infinity, the probability that a goal state has been reached approaches P = 1


SSP MDP Example


SSP MDP Example


SSP MDP Example


SSP MDP Example


SSP MDP Example


SSP MDP Example


SSP MDPs: Optimality Principle

  • For an SSP MDP, let:

    • Vπ(h) = Eπh[C1 + C2 + …] for all h

  • Then

    • V* exists and is stationary Markovian, π* exists and is stationary deterministic Markovian

    • For all s:V*(s) = maxa in A [Σs′ in S T(s, a, s′) [ C(s, a, s′) + V*(s′) ] ] π*(s) = argmaxa in A [Σs′ in S T(s, a, s′) [ C(s, a, s′) + V*(s′) ] ]


SSP MDPs: Optimality Principle

  • For an SSP MDP, let:

    • Vπ(h) = Eπh[C1 + C2 + …] for all h

  • Then

    • V* exists and is stationary Markovian, π* exists and is stationary deterministic Markovian

    • For all s:V*(s) = maxa in A [Σs′ in S T(s, a, s′) [ C(s, a, s′) + V*(s′) ] ] π*(s) = argmaxa in A [Σs′ in S T(s, a, s′) [ C(s, a, s′) + V*(s′) ] ]

ELAU

For every history, the value of a policy is well-defined

}

Every policy either takes a finite exp. # of steps to reach a goal, or has infinite cost


The MDP Hierarchy

  • FH => SSP: turn all states (s, L) into goals

  • IHDR => SSP: add (1 – γ)-probability transitions to goal

  • Focusing on SSP allows us to develop one set of algorithms to solve all three classes of MDPs.


Flat vs. Factored Representation of MDPs

  • We are only concerned with using flat representation

    • This is the name for the representation already introduced on the definition slides

    • It is easier to solve MDPs in flat representation, and it is much easier to describe larger MDPs in flat representation

  • If you are interested in factored representation, read Section 2.5


Computational Complexity of MDPs

  • Solving IHDR, SSP in flat representation is P-complete

  • Solving FH in flat representation is P-hard

  • They don’t benefit from parallelization, but are solvable in polynomial time


MDP Exact Solutions I: At a Glance

  • Brute-Force Alogrithm (3.1)

  • Policy Evaluation (3.2)


Brute Force Algorithm

  • Go over all policies π

    • How many? |A||S|, a finite amount

  • Evaluate each policy

    • Vπ(s), the expected cost of reaching the goal from s

  • Choose the best, π*

    • SSP optimality principle tells us that a best exists

    • Vπ*(s) ≤ Vπ(s)


Policy Evaluation

  • Given a policy π, compute Vπ

  • To start out, assume that π is proper

    • Execution of π reaches a goal from any state


Deterministic SSPs

  • Policy graph for π

    • π(s0) = a0; π(s1) = a1

  • Vπ(s1) = 1

  • Vπ(s0) = 5 + 1 = 6


Acyclic SSPs

  • Policy graph for π

  • Vπ(s1) = 1

  • Vπ(s2) = 4

  • Vπ(s0) = 0.6(5 + 1) + 0.4(2 + 4) = 6


Cyclic SSPs

  • Policy graph for π

  • Vπ(s1) = 1

  • Vπ(s2) = 0.7(4) + 0.3(3 + Vπ(s0))

  • Vπ(s0) = 0.6(5 + 1) + 0.4(2 + 0.7(4) + 0.3(3 + Vπ(s0)))


Cyclic SSPs

  • Generalized system of equations

  • Vπ(sg) = 0

  • Vπ(s1) = 1 + Vπ(sg)

  • Vπ(s2) = 0.7(4 + Vπ(sg)) + 0.3(3 + Vπ(s0))

  • Vπ(s0) = 0.6(5 + Vπ(s1)) + 0.4(2 + Vπ(s2))


Policy Evaluation with a System of Equations

  • Constructing the system of equations

    • Vπ(s) = 0 if s ∈ G

    • Vπ(s) = ∑s′ ∈ S T(s, π(s), s′) [C(s, π(s), s′) + Vπ(s′)]

  • |S| variables

  • O(|S|3) running time


Iterative Evaluation of Cyclic SSPs


Policy Evaluation with Iteration

  • Vnπ(s) ← ∑s′ ∈ S T(s, π(s), s′) [C(s, π(s), s′) + Vn-1π(s′)]

    • Iterative solution

  • Vπ(s) = ∑s′ ∈ S T(s, π(s), s′) [C(s, π(s), s′) + Vπ(s′)]

    • System of equations, for comparison


Convergence and Optimality

  • For a proper policy π,Iterative policy evaluation converges to the true value of the policy, i.e.limn→∞ Vnπ = Vπirrespective of the initialization V0π


Termination and Error Bounds

  • Residual: the magnitude of the change in the value of state s at iteration n in the algorithm.

    • residualn(s) = | Vnπ(s) – Vn-1π(s)|

    • residualn = maxs ∈ S residualn(s)

  • ϵ-consistency: the residual of the value function in iteration n + 1 is less than ϵ

    • A state s is ϵ-consistent if the value function is ϵ-consistent at s

  • When iterative policy evaluation is run to ϵ-consistency, then Vnπ satisfies the following inequality:∀s ∈ S: | Vnπ(s) – Vπ(s)| < ϵN π(s)Where N π(s) is the expected number of steps to reach goal g from s by following π


Policy Evaluation with Iteration: Algorithm

Each iteration is O(|S|2) time


MDP Problems: At a Glance

  • MDP Definition (2.1)

  • Solutions of an MDP (2.2)

  • Solution Existence (2.3)

  • Stochastic Shortest-Path MDPs (2.4)

  • Complexity of Solving MDPs (2.6)


MDP Exact Solutions I: At a Glance

  • Brute-Force Alogrithm (3.1)

  • Policy Evaluation (3.2)


Summary

  • General components of an MDP

  • Three types of MDPs (FH, IHDR, SSP)

  • Utility and cost (ELAU)

  • Ways to solve MDPs (Sys of equations, Iterative)


  • Login