Mdp problems and exact solutions i
This presentation is the property of its rightful owner.
Sponsored Links
1 / 58

MDP Problems and Exact Solutions I PowerPoint PPT Presentation


  • 157 Views
  • Uploaded on
  • Presentation posted in: General

MDP Problems and Exact Solutions I. Ryan Christiansen Department of Mechanical Engineering and Materials Science Rice University Slides adapted from Mausam and Andrey Kolobov. MDP Problems: At a Glance. MDP Definition (2.1) Solutions of an MDP (2.2) Solution Existence (2.3)

Download Presentation

MDP Problems and Exact Solutions I

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Mdp problems and exact solutions i

MDP Problems and Exact Solutions I

  • Ryan Christiansen

  • Department of Mechanical Engineering and Materials Science

  • Rice University

  • Slides adapted from Mausam and Andrey Kolobov


Mdp problems at a glance

MDP Problems: At a Glance

  • MDP Definition (2.1)

  • Solutions of an MDP (2.2)

  • Solution Existence (2.3)

  • Stochastic Shortest-Path MDPs (2.4)

  • Complexity of Solving MDPs (2.6)


Mdp definition

MDP Definition

  • MDP: an MDP is a tuple <S, A, D, T, R>

    • S is a finite state space

    • A is a finite action set

    • D is a sequence of discrete decision epochs (time steps)

    • T: S x A x S x D → [0, 1] is a transition function (probability)

    • R: S x A x S x D → ℝis a reward function


An mdp problem

An MDP Problem

  • How does an MDP problem work?

    • Initial Conditions: the starting state

    • Actions: actions are chosen at each decision epoch to traverse the MDP

    • Termination: reach a terminating state or the final decision epoch

  • The goal is to end up with the highest net reward at termination


The policy

The Policy

  • Policy: a rule for choosing actions

    • Global/Complete: a policy must always be applicable for the entire MDP

  • In general, policies will be

    • Probabilistic: able to choose between multiple actions randomly

    • History-Dependent: able to utilize the execution history, or the set of state and action pairs previously traversed

    • π: H x A → [0, 1]


Markovian policy

Markovian Policy

  • Markovian Policy: a history-dependent policy that only depends on the current state and time step

    • For any two histories hs,tand h′s,tboth of which end at the same state s and timestep t, and for any action a, a Markovian policy π will satisfy π(hs,t, a) = π(h′s,t, a)

    • In practice, it functions as a history-independent policy

    • π: S x D x A → [0, 1]

  • For several important types of MDPs, at least one optimal solution is necessarily Markovian


Stationary markovian policy

Stationary Markovian Policy

  • Stationary Markovian Policy: a Markovian policy that does not depend on time

    • For any two timesteps t1and t2and state s, and for any action a, a stationary Markovian policy π will satisfy π(s, t1, a) = π(s, t2, a)

    • π: S x A → [0, 1]


Evaluate a policy with the value function

Evaluate a Policy with the Value Function

  • Value Function: a function mapping the domain of the policy excluding the action set to a scalar value.

    • History dependent: V: H → [-∞, ∞]

    • Markovian: V: S x D → [-∞, ∞]

    • Stationary Markovian: V: S → [-∞, ∞]

  • Value Function of a Policy: the utility function of the reward sequence returned from executing the policy, or the total utility of the total reward

    • Vπ(hs,t) = u(Rtπhs,t, Rt+1πhs,t, …)


Solutions of an mdp

Solutions of an MDP

  • A solution to an MDP is an optimal policy, or a policy that maximizes utility.

  • Policy π* is optimal if the value function, V*, is greater than or equal to the value function of any other policy.

    • V*(h) ≥ Vπ(h) for all h and π

  • Need to be careful when defining utility u(R1, R2, …)

    • For the same h, utility can be different across policy executions

  • Existence and uniqueness are not guaranteed for many types of MDPs.


Expected linear additive utility elau

Expected Linear Additive Utility (ELAU)

  • u(R1, R2, …) = E(R1 + γR2 + γ2R3 …)where γis the discount factor

  • Assume γ = 1 unless stated otherwise

    • 0 ≤ γ < 1 : more immediate rewards are more valuable

    • γ = 1 : rewards are equally valuable, independently of time

    • γ > 1 : more distant rewards are more valuable


The optimality principle

The Optimality Principle

  • The Optimality Principle: if every policy’s quality can be measured by this policy’s ELAU, there exists a policy that is optimal at every timestep

  • There are some situations where it may not apply:

    • When stuck in a repeating sequence of states (a loop)

    • Infinite decision epochs

    • Infinite utility


The optimality principle does not hold

The Optimality Principle does not Hold

  • Oscillating Utility

  • Unbounded Utility


Further utility considerations

Further Utility Considerations

  • Risk averse, or risk taking (ELAU is risk neutral)

    • $1 million guaranteed (risk averse)

    • 50% chance of $2 million, 50% chance of $0 (risk taking)

    • Expected value is the same, so risk neutral would choose either


3 models with well defined policy elau

3 Models with Well-Defined Policy ELAU

  • Finite-horizon MDPs

  • Infinite-horizon discounted-reward MDPs

  • Stochastic shortest-path MDPs


Finite horizon mdps motivation

Finite-Horizon MDPs: Motivation

  • Assume the agent acts for a finite # of time steps, L

  • Example applications:

    • Inventory management“How much X to order fromthe supplier every day ‘tilthe end of the season?”

    • Maintenance scheduling“When to schedule disruptive maintenance jobs by their deadline?”


Finite horizon mdps definition

Finite-Horizon MDPs: Definition

Puterman, 1994

  • FH MDP: an FH MDP is a tuple <S, A, D, T, R>

    • S is a finite state space

    • A is a finite action set

    • D is a sequence of discrete decision epochs (time steps) up to a finite horizon L

    • T: S x A x S x D → [0, 1] is a transition function (probability)

    • R: S x A x S x D → ℝis a reward function


Finite horizon mdps optimality principle

Finite-Horizon MDPs: Optimality Principle

  • For an FH MDP with horizon |D| = L < ∞, let:

    • Vπ(hs,t) = Eπhs,t[R1 + … + RL – t] for all 1 ≤ t ≤ L

    • Vπ(hs,L +1) = 0

  • Then

    • V* exists and is Markovian, π* exists and is det. Markovian

    • For all sand1 ≤ t ≤ L:V*(s,t) = maxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ] π*(s,t) = argmaxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ]


Finite horizon mdps optimality principle1

Finite-Horizon MDPs: Optimality Principle

  • For an FH MDP with horizon |D| = L < ∞, let:

    • Vπ(hs,t) = Eπhs,t[R1 + … + RL – t] for all 1 ≤ t ≤ L

    • Vπ(hs,L +1) = 0

  • Then

    • V* exists and is Markovian, π* exists and is det. Markovian

    • For all sand1 ≤ t ≤ L: V*(s,t) = maxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ] π*(s,t) = argmaxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ]

ELAU

For every history, the value of every policy is well-defined

}

}

Each E[Ri] is finite

}

# of terms in the series is finite


Finite horizon mdps optimality principle2

Finite-Horizon MDPs: Optimality Principle

  • For an FH MDP with horizon |D| = L < ∞, let:

    • Vπ(hs,t) = Eπhs,t[R1 + … + RL – t] for all 1 ≤ t ≤ L

    • Vπ(hs,L +1) = 0

  • Then

    • V* exists and is Markovian, π* exists and is det. Markovian

    • For all sand1 ≤ t ≤ L: V*(s,t) = maxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ] π*(s,t) = argmaxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ]

ELAU

For every history, the value of every policy is well-defined

}

}

Each E[Ri] is finite

}

# of terms in the series is finite

Immediate utility of the next action

If you act optimally now

{

{

}

}

}

Highest utility derivable from the next state

Highest utility derivable from s at time t

In expectation


Perks of the fh mdp optimality principle

Perks of the FH MDP Optimality Principle

  • If V* and π* are Markovian, then we only need to consider Markovian V and π

  • Easy to compute π*

    • For all s, compute V*(s, t) and π*(s, t) for t = L, …, 1


To infinity and beyond

To Infinity and Beyond

  • Why go beyond the finite horizon?

    • Autonomous agents with long lifespans (elevators, investments, airplanes, etc.)

  • Infinite Horizon

    • Known to be infinite (can continue indefinitely)

  • Indefinite Horizon

    • Known to be unbounded (finite processes that can be delayed or extended but will eventually reach a terminal state)


Analyzing mdps with in de finite horizon

Analyzing MDPs with In(de)finite Horizon

  • Due to the infinite nature of D, we must define stationary, or time independent functions:

    • T: S x A x S → [0, 1] is a transition function (probability)

    • R: S x A x S → ℝis a reward function

    • π: S → A (this is also Markovian)

    • V: S → [-∞, ∞] (this is also Markovian)


Infinite horizon discounted reward mdps definition

Infinite-Horizon Discounted-Reward MDPs: Definition

  • IHDR MDP: an IHDR MDP is a tuple <S, A, T, R, γ>

    • S is a finite state space

    • A is a finite action set

    • T: S x A x S → [0, 1] is a transition function (probability)

    • R: S x A x S → ℝis a reward function

    • γ is a discount factor between 0 and 1 (favors immediate rewards)

  • Policy value = discounted ELAU over infinite time steps


Infinite horizon discounted reward mdps optimality principle

Infinite-Horizon Discounted-Reward MDPs: Optimality Principle

  • For an IHDR MDP, let:

    • Vπ(h) = Eπh[R1 + γR2 + γ2R3 …] for all h

  • Then

    • V* exists and is stationary Markovian, π* exists and is stationary deterministic Markovian

    • For all s:V*(s) = maxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ] π*(s) = argmaxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ]


Infinite horizon discounted reward mdps optimality principle1

Infinite-Horizon Discounted-Reward MDPs: Optimality Principle

  • For an IHDR MDP, let:

    • Vπ(h) = Eπh[R1 + γR2 + γ2R3 …] for all h

  • Then

    • V* exists and is stationary Markovian, π* exists and is stationary deterministic Markovian

    • For all s:V*(s) = maxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ] π*(s) = argmaxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ]

ELAU

For every history, the value of a policy is well-defined thanks to 0 ≤ γ < 1

}

}

}

All γE[Ri] are bounded by some finite K and converge geometrically


Infinite horizon discounted reward mdps optimality principle2

Infinite-Horizon Discounted-Reward MDPs: Optimality Principle

  • For an IHDR MDP, let:

    • Vπ(h) = Eπh[R1 + γR2 + γ2R3 …] for all h

  • Then

    • V* exists and is stationary Markovian, π* exists and is stationary deterministic Markovian

    • For all s:V*(s) = maxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ] π*(s) = argmaxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ]

ELAU

For every history, the value of a policy is well-defined thanks to 0 ≤ γ < 1

}

}

}

All γE[Ri] are bounded by some finite K and converge geometrically

Future utility is discounted

Optimal utility is time independent

{

{


Perks of the ihdr mdp optimality principle

Perks of the IHDR MDP Optimality Principle

  • If V* and π* are stationary Markovian, then we only need to consider stationary Markovian V and π


The meaning of

The Meaning of γ

  • γcan affect optimal policy significantly

    • γ = 0 + ε: yields myopic policies for impatient agents

    • γ = 1 - ε: yields far-sighted policies, inefficient to compute

  • How to set it?

    • Sometimes suggested by data (inflation rate, interest rate, tax rate)

    • Often set arbitrarily to a value that gives a reasonable policy


Stochastic shortest path mdps motivation

Stochastic Shortest-Path MDPs: Motivation

  • Assume the agent pays a cost to achieve a goal

  • Example applications:

    • Controlling a Mars rover“How to collect scientific data without damaging the rover?”

    • Navigation“What’s the fastest wayto get to a destination, takinginto account the traffic jams?”

  • Cost is often time or a physical resource


Stochastic shortest path mdps definition

Stochastic Shortest-Path MDPs : Definition

  • SSP MDP: an SSP MDP is a tuple <S, A, T, C, G>

    • S is a finite state space

    • A is a finite action set

    • T: S x A x S → [0, 1] is a stationary transition function (probability)

    • C: S x A x S → ℝis a stationary cost function (R = -C)

    • G ⊆ S is a set of absorbing cost-free goal states

  • Under two conditions:

    • There is at least one proper policy (reaches goal with P = 1 from all states)

    • Every improper policy incurs a cost of infinity from every state from which it does not reach the goal with P = 1


Ssp mdp details

SSP MDP Details

  • In SSP, maximizing ELAU = minimizing expected cost

  • Every cost-minimizing policy is proper

  • Thus, an optimal policy is the cheapest way to reach the goal

  • Why are SSP MDPs called “indefinite-horizon?”

    • If a policy is optimal, it will take a finite, but a priori unknown time to reach the goal (time to goal is dependent on the evaluation of the policy).

    • At the limit as t approaches infinity, the probability that a goal state has been reached approaches P = 1


Ssp mdp example

SSP MDP Example


Ssp mdp example1

SSP MDP Example


Ssp mdp example2

SSP MDP Example


Ssp mdp example3

SSP MDP Example


Ssp mdp example4

SSP MDP Example


Ssp mdp example5

SSP MDP Example


Ssp mdps optimality principle

SSP MDPs: Optimality Principle

  • For an SSP MDP, let:

    • Vπ(h) = Eπh[C1 + C2 + …] for all h

  • Then

    • V* exists and is stationary Markovian, π* exists and is stationary deterministic Markovian

    • For all s:V*(s) = maxa in A [Σs′ in S T(s, a, s′) [ C(s, a, s′) + V*(s′) ] ] π*(s) = argmaxa in A [Σs′ in S T(s, a, s′) [ C(s, a, s′) + V*(s′) ] ]


Ssp mdps optimality principle1

SSP MDPs: Optimality Principle

  • For an SSP MDP, let:

    • Vπ(h) = Eπh[C1 + C2 + …] for all h

  • Then

    • V* exists and is stationary Markovian, π* exists and is stationary deterministic Markovian

    • For all s:V*(s) = maxa in A [Σs′ in S T(s, a, s′) [ C(s, a, s′) + V*(s′) ] ] π*(s) = argmaxa in A [Σs′ in S T(s, a, s′) [ C(s, a, s′) + V*(s′) ] ]

ELAU

For every history, the value of a policy is well-defined

}

Every policy either takes a finite exp. # of steps to reach a goal, or has infinite cost


The mdp hierarchy

The MDP Hierarchy

  • FH => SSP: turn all states (s, L) into goals

  • IHDR => SSP: add (1 – γ)-probability transitions to goal

  • Focusing on SSP allows us to develop one set of algorithms to solve all three classes of MDPs.


Flat vs factored representation of mdps

Flat vs. Factored Representation of MDPs

  • We are only concerned with using flat representation

    • This is the name for the representation already introduced on the definition slides

    • It is easier to solve MDPs in flat representation, and it is much easier to describe larger MDPs in flat representation

  • If you are interested in factored representation, read Section 2.5


Computational complexity of mdps

Computational Complexity of MDPs

  • Solving IHDR, SSP in flat representation is P-complete

  • Solving FH in flat representation is P-hard

  • They don’t benefit from parallelization, but are solvable in polynomial time


Mdp exact solutions i at a glance

MDP Exact Solutions I: At a Glance

  • Brute-Force Alogrithm (3.1)

  • Policy Evaluation (3.2)


Brute force algorithm

Brute Force Algorithm

  • Go over all policies π

    • How many? |A||S|, a finite amount

  • Evaluate each policy

    • Vπ(s), the expected cost of reaching the goal from s

  • Choose the best, π*

    • SSP optimality principle tells us that a best exists

    • Vπ*(s) ≤ Vπ(s)


Policy evaluation

Policy Evaluation

  • Given a policy π, compute Vπ

  • To start out, assume that π is proper

    • Execution of π reaches a goal from any state


Deterministic ssps

Deterministic SSPs

  • Policy graph for π

    • π(s0) = a0; π(s1) = a1

  • Vπ(s1) = 1

  • Vπ(s0) = 5 + 1 = 6


Acyclic ssps

Acyclic SSPs

  • Policy graph for π

  • Vπ(s1) = 1

  • Vπ(s2) = 4

  • Vπ(s0) = 0.6(5 + 1) + 0.4(2 + 4) = 6


Cyclic ssps

Cyclic SSPs

  • Policy graph for π

  • Vπ(s1) = 1

  • Vπ(s2) = 0.7(4) + 0.3(3 + Vπ(s0))

  • Vπ(s0) = 0.6(5 + 1) + 0.4(2 + 0.7(4) + 0.3(3 + Vπ(s0)))


Cyclic ssps1

Cyclic SSPs

  • Generalized system of equations

  • Vπ(sg) = 0

  • Vπ(s1) = 1 + Vπ(sg)

  • Vπ(s2) = 0.7(4 + Vπ(sg)) + 0.3(3 + Vπ(s0))

  • Vπ(s0) = 0.6(5 + Vπ(s1)) + 0.4(2 + Vπ(s2))


Policy evaluation with a system of equations

Policy Evaluation with a System of Equations

  • Constructing the system of equations

    • Vπ(s) = 0 if s ∈ G

    • Vπ(s) = ∑s′ ∈ S T(s, π(s), s′) [C(s, π(s), s′) + Vπ(s′)]

  • |S| variables

  • O(|S|3) running time


Iterative evaluation of cyclic ssps

Iterative Evaluation of Cyclic SSPs


Policy evaluation with iteration

Policy Evaluation with Iteration

  • Vnπ(s) ← ∑s′ ∈ S T(s, π(s), s′) [C(s, π(s), s′) + Vn-1π(s′)]

    • Iterative solution

  • Vπ(s) = ∑s′ ∈ S T(s, π(s), s′) [C(s, π(s), s′) + Vπ(s′)]

    • System of equations, for comparison


Convergence and optimality

Convergence and Optimality

  • For a proper policy π,Iterative policy evaluation converges to the true value of the policy, i.e.limn→∞ Vnπ = Vπirrespective of the initialization V0π


Termination and error bounds

Termination and Error Bounds

  • Residual: the magnitude of the change in the value of state s at iteration n in the algorithm.

    • residualn(s) = | Vnπ(s) – Vn-1π(s)|

    • residualn = maxs ∈ S residualn(s)

  • ϵ-consistency: the residual of the value function in iteration n + 1 is less than ϵ

    • A state s is ϵ-consistent if the value function is ϵ-consistent at s

  • When iterative policy evaluation is run to ϵ-consistency, then Vnπ satisfies the following inequality:∀s ∈ S: | Vnπ(s) – Vπ(s)| < ϵN π(s)Where N π(s) is the expected number of steps to reach goal g from s by following π


Policy evaluation with iteration algorithm

Policy Evaluation with Iteration: Algorithm

Each iteration is O(|S|2) time


Mdp problems at a glance1

MDP Problems: At a Glance

  • MDP Definition (2.1)

  • Solutions of an MDP (2.2)

  • Solution Existence (2.3)

  • Stochastic Shortest-Path MDPs (2.4)

  • Complexity of Solving MDPs (2.6)


Mdp exact solutions i at a glance1

MDP Exact Solutions I: At a Glance

  • Brute-Force Alogrithm (3.1)

  • Policy Evaluation (3.2)


Summary

Summary

  • General components of an MDP

  • Three types of MDPs (FH, IHDR, SSP)

  • Utility and cost (ELAU)

  • Ways to solve MDPs (Sys of equations, Iterative)


  • Login