Mdp problems and exact solutions i
Download
1 / 58

MDP Problems and Exact Solutions I - PowerPoint PPT Presentation


  • 230 Views
  • Uploaded on

MDP Problems and Exact Solutions I. Ryan Christiansen Department of Mechanical Engineering and Materials Science Rice University Slides adapted from Mausam and Andrey Kolobov. MDP Problems: At a Glance. MDP Definition (2.1) Solutions of an MDP (2.2) Solution Existence (2.3)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'MDP Problems and Exact Solutions I' - alaqua


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Mdp problems and exact solutions i
MDP Problems and Exact Solutions I

  • Ryan Christiansen

  • Department of Mechanical Engineering and Materials Science

  • Rice University

  • Slides adapted from Mausam and Andrey Kolobov


Mdp problems at a glance
MDP Problems: At a Glance

  • MDP Definition (2.1)

  • Solutions of an MDP (2.2)

  • Solution Existence (2.3)

  • Stochastic Shortest-Path MDPs (2.4)

  • Complexity of Solving MDPs (2.6)


Mdp definition
MDP Definition

  • MDP: an MDP is a tuple <S, A, D, T, R>

    • S is a finite state space

    • A is a finite action set

    • D is a sequence of discrete decision epochs (time steps)

    • T: S x A x S x D → [0, 1] is a transition function (probability)

    • R: S x A x S x D → ℝis a reward function


An mdp problem
An MDP Problem

  • How does an MDP problem work?

    • Initial Conditions: the starting state

    • Actions: actions are chosen at each decision epoch to traverse the MDP

    • Termination: reach a terminating state or the final decision epoch

  • The goal is to end up with the highest net reward at termination


The policy
The Policy

  • Policy: a rule for choosing actions

    • Global/Complete: a policy must always be applicable for the entire MDP

  • In general, policies will be

    • Probabilistic: able to choose between multiple actions randomly

    • History-Dependent: able to utilize the execution history, or the set of state and action pairs previously traversed

    • π: H x A → [0, 1]


Markovian policy
Markovian Policy

  • Markovian Policy: a history-dependent policy that only depends on the current state and time step

    • For any two histories hs,tand h′s,tboth of which end at the same state s and timestep t, and for any action a, a Markovian policy π will satisfy π(hs,t, a) = π(h′s,t, a)

    • In practice, it functions as a history-independent policy

    • π: S x D x A → [0, 1]

  • For several important types of MDPs, at least one optimal solution is necessarily Markovian


Stationary markovian policy
Stationary Markovian Policy

  • Stationary Markovian Policy: a Markovian policy that does not depend on time

    • For any two timesteps t1and t2and state s, and for any action a, a stationary Markovian policy π will satisfy π(s, t1, a) = π(s, t2, a)

    • π: S x A → [0, 1]


Evaluate a policy with the value function
Evaluate a Policy with the Value Function

  • Value Function: a function mapping the domain of the policy excluding the action set to a scalar value.

    • History dependent: V: H → [-∞, ∞]

    • Markovian: V: S x D → [-∞, ∞]

    • Stationary Markovian: V: S → [-∞, ∞]

  • Value Function of a Policy: the utility function of the reward sequence returned from executing the policy, or the total utility of the total reward

    • Vπ(hs,t) = u(Rtπhs,t, Rt+1πhs,t, …)


Solutions of an mdp
Solutions of an MDP

  • A solution to an MDP is an optimal policy, or a policy that maximizes utility.

  • Policy π* is optimal if the value function, V*, is greater than or equal to the value function of any other policy.

    • V*(h) ≥ Vπ(h) for all h and π

  • Need to be careful when defining utility u(R1, R2, …)

    • For the same h, utility can be different across policy executions

  • Existence and uniqueness are not guaranteed for many types of MDPs.


Expected linear additive utility elau
Expected Linear Additive Utility (ELAU)

  • u(R1, R2, …) = E(R1 + γR2 + γ2R3 …)where γis the discount factor

  • Assume γ = 1 unless stated otherwise

    • 0 ≤ γ < 1 : more immediate rewards are more valuable

    • γ = 1 : rewards are equally valuable, independently of time

    • γ > 1 : more distant rewards are more valuable


The optimality principle
The Optimality Principle

  • The Optimality Principle: if every policy’s quality can be measured by this policy’s ELAU, there exists a policy that is optimal at every timestep

  • There are some situations where it may not apply:

    • When stuck in a repeating sequence of states (a loop)

    • Infinite decision epochs

    • Infinite utility


The optimality principle does not hold
The Optimality Principle does not Hold

  • Oscillating Utility

  • Unbounded Utility


Further utility considerations
Further Utility Considerations

  • Risk averse, or risk taking (ELAU is risk neutral)

    • $1 million guaranteed (risk averse)

    • 50% chance of $2 million, 50% chance of $0 (risk taking)

    • Expected value is the same, so risk neutral would choose either


3 models with well defined policy elau
3 Models with Well-Defined Policy ELAU

  • Finite-horizon MDPs

  • Infinite-horizon discounted-reward MDPs

  • Stochastic shortest-path MDPs


Finite horizon mdps motivation
Finite-Horizon MDPs: Motivation

  • Assume the agent acts for a finite # of time steps, L

  • Example applications:

    • Inventory management“How much X to order from the supplier every day ‘til the end of the season?”

    • Maintenance scheduling“When to schedule disruptive maintenance jobs by their deadline?”


Finite horizon mdps definition
Finite-Horizon MDPs: Definition

Puterman, 1994

  • FH MDP: an FH MDP is a tuple <S, A, D, T, R>

    • S is a finite state space

    • A is a finite action set

    • D is a sequence of discrete decision epochs (time steps) up to a finite horizon L

    • T: S x A x S x D → [0, 1] is a transition function (probability)

    • R: S x A x S x D → ℝis a reward function


Finite horizon mdps optimality principle
Finite-Horizon MDPs: Optimality Principle

  • For an FH MDP with horizon |D| = L < ∞, let:

    • Vπ(hs,t) = Eπhs,t[R1 + … + RL – t] for all 1 ≤ t ≤ L

    • Vπ(hs,L +1) = 0

  • Then

    • V* exists and is Markovian, π* exists and is det. Markovian

    • For all sand1 ≤ t ≤ L:V*(s,t) = maxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ] π*(s,t) = argmaxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ]


Finite horizon mdps optimality principle1
Finite-Horizon MDPs: Optimality Principle

  • For an FH MDP with horizon |D| = L < ∞, let:

    • Vπ(hs,t) = Eπhs,t[R1 + … + RL – t] for all 1 ≤ t ≤ L

    • Vπ(hs,L +1) = 0

  • Then

    • V* exists and is Markovian, π* exists and is det. Markovian

    • For all sand1 ≤ t ≤ L: V*(s,t) = maxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ] π*(s,t) = argmaxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ]

ELAU

For every history, the value of every policy is well-defined

}

}

Each E[Ri] is finite

}

# of terms in the series is finite


Finite horizon mdps optimality principle2
Finite-Horizon MDPs: Optimality Principle

  • For an FH MDP with horizon |D| = L < ∞, let:

    • Vπ(hs,t) = Eπhs,t[R1 + … + RL – t] for all 1 ≤ t ≤ L

    • Vπ(hs,L +1) = 0

  • Then

    • V* exists and is Markovian, π* exists and is det. Markovian

    • For all sand1 ≤ t ≤ L: V*(s,t) = maxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ] π*(s,t) = argmaxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ]

ELAU

For every history, the value of every policy is well-defined

}

}

Each E[Ri] is finite

}

# of terms in the series is finite

Immediate utility of the next action

If you act optimally now

{

{

}

}

}

Highest utility derivable from the next state

Highest utility derivable from s at time t

In expectation


Perks of the fh mdp optimality principle
Perks of the FH MDP Optimality Principle

  • If V* and π* are Markovian, then we only need to consider Markovian V and π

  • Easy to compute π*

    • For all s, compute V*(s, t) and π*(s, t) for t = L, …, 1


To infinity and beyond
To Infinity and Beyond

  • Why go beyond the finite horizon?

    • Autonomous agents with long lifespans (elevators, investments, airplanes, etc.)

  • Infinite Horizon

    • Known to be infinite (can continue indefinitely)

  • Indefinite Horizon

    • Known to be unbounded (finite processes that can be delayed or extended but will eventually reach a terminal state)


Analyzing mdps with in de finite horizon
Analyzing MDPs with In(de)finite Horizon

  • Due to the infinite nature of D, we must define stationary, or time independent functions:

    • T: S x A x S → [0, 1] is a transition function (probability)

    • R: S x A x S → ℝis a reward function

    • π: S → A (this is also Markovian)

    • V: S → [-∞, ∞] (this is also Markovian)


Infinite horizon discounted reward mdps definition
Infinite-Horizon Discounted-Reward MDPs: Definition

  • IHDR MDP: an IHDR MDP is a tuple <S, A, T, R, γ>

    • S is a finite state space

    • A is a finite action set

    • T: S x A x S → [0, 1] is a transition function (probability)

    • R: S x A x S → ℝis a reward function

    • γ is a discount factor between 0 and 1 (favors immediate rewards)

  • Policy value = discounted ELAU over infinite time steps


Infinite horizon discounted reward mdps optimality principle
Infinite-Horizon Discounted-Reward MDPs: Optimality Principle

  • For an IHDR MDP, let:

    • Vπ(h) = Eπh[R1 + γR2 + γ2R3 …] for all h

  • Then

    • V* exists and is stationary Markovian, π* exists and is stationary deterministic Markovian

    • For all s:V*(s) = maxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ] π*(s) = argmaxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ]


Infinite horizon discounted reward mdps optimality principle1
Infinite-Horizon Discounted-Reward MDPs: Optimality Principle

  • For an IHDR MDP, let:

    • Vπ(h) = Eπh[R1 + γR2 + γ2R3 …] for all h

  • Then

    • V* exists and is stationary Markovian, π* exists and is stationary deterministic Markovian

    • For all s:V*(s) = maxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ] π*(s) = argmaxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ]

ELAU

For every history, the value of a policy is well-defined thanks to 0 ≤ γ < 1

}

}

}

All γE[Ri] are bounded by some finite K and converge geometrically


Infinite horizon discounted reward mdps optimality principle2
Infinite-Horizon Discounted-Reward MDPs: Optimality Principle

  • For an IHDR MDP, let:

    • Vπ(h) = Eπh[R1 + γR2 + γ2R3 …] for all h

  • Then

    • V* exists and is stationary Markovian, π* exists and is stationary deterministic Markovian

    • For all s:V*(s) = maxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ] π*(s) = argmaxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ]

ELAU

For every history, the value of a policy is well-defined thanks to 0 ≤ γ < 1

}

}

}

All γE[Ri] are bounded by some finite K and converge geometrically

Future utility is discounted

Optimal utility is time independent

{

{


Perks of the ihdr mdp optimality principle
Perks of the IHDR MDP Optimality Principle Principle

  • If V* and π* are stationary Markovian, then we only need to consider stationary Markovian V and π


The meaning of
The Meaning of Principleγ

  • γcan affect optimal policy significantly

    • γ = 0 + ε: yields myopic policies for impatient agents

    • γ = 1 - ε: yields far-sighted policies, inefficient to compute

  • How to set it?

    • Sometimes suggested by data (inflation rate, interest rate, tax rate)

    • Often set arbitrarily to a value that gives a reasonable policy


Stochastic shortest path mdps motivation
Stochastic Shortest-Path MDPs: Motivation Principle

  • Assume the agent pays a cost to achieve a goal

  • Example applications:

    • Controlling a Mars rover“How to collect scientific data without damaging the rover?”

    • Navigation“What’s the fastest way to get to a destination, taking into account the traffic jams?”

  • Cost is often time or a physical resource


Stochastic shortest path mdps definition
Stochastic Shortest-Path MDPs : Definition Principle

  • SSP MDP: an SSP MDP is a tuple <S, A, T, C, G>

    • S is a finite state space

    • A is a finite action set

    • T: S x A x S → [0, 1] is a stationary transition function (probability)

    • C: S x A x S → ℝis a stationary cost function (R = -C)

    • G ⊆ S is a set of absorbing cost-free goal states

  • Under two conditions:

    • There is at least one proper policy (reaches goal with P = 1 from all states)

    • Every improper policy incurs a cost of infinity from every state from which it does not reach the goal with P = 1


Ssp mdp details
SSP MDP Details Principle

  • In SSP, maximizing ELAU = minimizing expected cost

  • Every cost-minimizing policy is proper

  • Thus, an optimal policy is the cheapest way to reach the goal

  • Why are SSP MDPs called “indefinite-horizon?”

    • If a policy is optimal, it will take a finite, but a priori unknown time to reach the goal (time to goal is dependent on the evaluation of the policy).

    • At the limit as t approaches infinity, the probability that a goal state has been reached approaches P = 1


Ssp mdp example
SSP MDP Example Principle


Ssp mdp example1
SSP MDP Example Principle


Ssp mdp example2
SSP MDP Example Principle


Ssp mdp example3
SSP MDP Example Principle


Ssp mdp example4
SSP MDP Example Principle


Ssp mdp example5
SSP MDP Example Principle


Ssp mdps optimality principle
SSP MDPs: Optimality Principle Principle

  • For an SSP MDP, let:

    • Vπ(h) = Eπh[C1 + C2 + …] for all h

  • Then

    • V* exists and is stationary Markovian, π* exists and is stationary deterministic Markovian

    • For all s:V*(s) = maxa in A [Σs′ in S T(s, a, s′) [ C(s, a, s′) + V*(s′) ] ] π*(s) = argmaxa in A [Σs′ in S T(s, a, s′) [ C(s, a, s′) + V*(s′) ] ]


Ssp mdps optimality principle1
SSP MDPs: Optimality Principle Principle

  • For an SSP MDP, let:

    • Vπ(h) = Eπh[C1 + C2 + …] for all h

  • Then

    • V* exists and is stationary Markovian, π* exists and is stationary deterministic Markovian

    • For all s:V*(s) = maxa in A [Σs′ in S T(s, a, s′) [ C(s, a, s′) + V*(s′) ] ] π*(s) = argmaxa in A [Σs′ in S T(s, a, s′) [ C(s, a, s′) + V*(s′) ] ]

ELAU

For every history, the value of a policy is well-defined

}

Every policy either takes a finite exp. # of steps to reach a goal, or has infinite cost


The mdp hierarchy
The MDP Hierarchy Principle

  • FH => SSP: turn all states (s, L) into goals

  • IHDR => SSP: add (1 – γ)-probability transitions to goal

  • Focusing on SSP allows us to develop one set of algorithms to solve all three classes of MDPs.


Flat vs factored representation of mdps
Flat vs. Factored Representation of MDPs Principle

  • We are only concerned with using flat representation

    • This is the name for the representation already introduced on the definition slides

    • It is easier to solve MDPs in flat representation, and it is much easier to describe larger MDPs in flat representation

  • If you are interested in factored representation, read Section 2.5


Computational complexity of mdps
Computational Complexity of MDPs Principle

  • Solving IHDR, SSP in flat representation is P-complete

  • Solving FH in flat representation is P-hard

  • They don’t benefit from parallelization, but are solvable in polynomial time


Mdp exact solutions i at a glance
MDP Exact Solutions I: At a Glance Principle

  • Brute-Force Alogrithm (3.1)

  • Policy Evaluation (3.2)


Brute force algorithm
Brute Force Algorithm Principle

  • Go over all policies π

    • How many? |A||S|, a finite amount

  • Evaluate each policy

    • Vπ(s), the expected cost of reaching the goal from s

  • Choose the best, π*

    • SSP optimality principle tells us that a best exists

    • Vπ*(s) ≤ Vπ(s)


Policy evaluation
Policy Evaluation Principle

  • Given a policy π, compute Vπ

  • To start out, assume that π is proper

    • Execution of π reaches a goal from any state


Deterministic ssps
Deterministic SSPs Principle

  • Policy graph for π

    • π(s0) = a0; π(s1) = a1

  • Vπ(s1) = 1

  • Vπ(s0) = 5 + 1 = 6


Acyclic ssps
Acyclic SSPs Principle

  • Policy graph for π

  • Vπ(s1) = 1

  • Vπ(s2) = 4

  • Vπ(s0) = 0.6(5 + 1) + 0.4(2 + 4) = 6


Cyclic ssps
Cyclic SSPs Principle

  • Policy graph for π

  • Vπ(s1) = 1

  • Vπ(s2) = 0.7(4) + 0.3(3 + Vπ(s0))

  • Vπ(s0) = 0.6(5 + 1) + 0.4(2 + 0.7(4) + 0.3(3 + Vπ(s0)))


Cyclic ssps1
Cyclic SSPs Principle

  • Generalized system of equations

  • Vπ(sg) = 0

  • Vπ(s1) = 1 + Vπ(sg)

  • Vπ(s2) = 0.7(4 + Vπ(sg)) + 0.3(3 + Vπ(s0))

  • Vπ(s0) = 0.6(5 + Vπ(s1)) + 0.4(2 + Vπ(s2))


Policy evaluation with a system of equations
Policy Evaluation with a System of Equations Principle

  • Constructing the system of equations

    • Vπ(s) = 0 if s ∈ G

    • Vπ(s) = ∑s′ ∈ S T(s, π(s), s′) [C(s, π(s), s′) + Vπ(s′)]

  • |S| variables

  • O(|S|3) running time



Policy evaluation with iteration
Policy Evaluation with Iteration Principle

  • Vnπ(s) ← ∑s′ ∈ S T(s, π(s), s′) [C(s, π(s), s′) + Vn-1π(s′)]

    • Iterative solution

  • Vπ(s) = ∑s′ ∈ S T(s, π(s), s′) [C(s, π(s), s′) + Vπ(s′)]

    • System of equations, for comparison


Convergence and optimality
Convergence and Optimality Principle

  • For a proper policy π,Iterative policy evaluation converges to the true value of the policy, i.e.limn→∞ Vnπ = Vπirrespective of the initialization V0π


Termination and error bounds
Termination and Error Bounds Principle

  • Residual: the magnitude of the change in the value of state s at iteration n in the algorithm.

    • residualn(s) = | Vnπ(s) – Vn-1π(s)|

    • residualn = maxs ∈ S residualn(s)

  • ϵ-consistency: the residual of the value function in iteration n + 1 is less than ϵ

    • A state s is ϵ-consistent if the value function is ϵ-consistent at s

  • When iterative policy evaluation is run to ϵ-consistency, then Vnπ satisfies the following inequality:∀s ∈ S: | Vnπ(s) – Vπ(s)| < ϵN π(s)Where N π(s) is the expected number of steps to reach goal g from s by following π


Policy evaluation with iteration algorithm
Policy Evaluation with Iteration: Algorithm Principle

Each iteration is O(|S|2) time


Mdp problems at a glance1
MDP Problems: At a Glance Principle

  • MDP Definition (2.1)

  • Solutions of an MDP (2.2)

  • Solution Existence (2.3)

  • Stochastic Shortest-Path MDPs (2.4)

  • Complexity of Solving MDPs (2.6)


Mdp exact solutions i at a glance1
MDP Exact Solutions I: At a Glance Principle

  • Brute-Force Alogrithm (3.1)

  • Policy Evaluation (3.2)


Summary
Summary Principle

  • General components of an MDP

  • Three types of MDPs (FH, IHDR, SSP)

  • Utility and cost (ELAU)

  • Ways to solve MDPs (Sys of equations, Iterative)