- 166 Views
- Uploaded on
- Presentation posted in: General

MDP Problems and Exact Solutions I

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

- Ryan Christiansen
- Department of Mechanical Engineering and Materials Science
- Rice University
- Slides adapted from Mausam and Andrey Kolobov

- MDP Definition (2.1)
- Solutions of an MDP (2.2)
- Solution Existence (2.3)
- Stochastic Shortest-Path MDPs (2.4)
- Complexity of Solving MDPs (2.6)

- MDP: an MDP is a tuple <S, A, D, T, R>
- S is a finite state space
- A is a finite action set
- D is a sequence of discrete decision epochs (time steps)
- T: S x A x S x D → [0, 1] is a transition function (probability)
- R: S x A x S x D → ℝis a reward function

- How does an MDP problem work?
- Initial Conditions: the starting state
- Actions: actions are chosen at each decision epoch to traverse the MDP
- Termination: reach a terminating state or the final decision epoch

- The goal is to end up with the highest net reward at termination

- Policy: a rule for choosing actions
- Global/Complete: a policy must always be applicable for the entire MDP

- In general, policies will be
- Probabilistic: able to choose between multiple actions randomly
- History-Dependent: able to utilize the execution history, or the set of state and action pairs previously traversed
- π: H x A → [0, 1]

- Markovian Policy: a history-dependent policy that only depends on the current state and time step
- For any two histories hs,tand h′s,tboth of which end at the same state s and timestep t, and for any action a, a Markovian policy π will satisfy π(hs,t, a) = π(h′s,t, a)
- In practice, it functions as a history-independent policy
- π: S x D x A → [0, 1]

- For several important types of MDPs, at least one optimal solution is necessarily Markovian

- Stationary Markovian Policy: a Markovian policy that does not depend on time
- For any two timesteps t1and t2and state s, and for any action a, a stationary Markovian policy π will satisfy π(s, t1, a) = π(s, t2, a)
- π: S x A → [0, 1]

- Value Function: a function mapping the domain of the policy excluding the action set to a scalar value.
- History dependent: V: H → [-∞, ∞]
- Markovian: V: S x D → [-∞, ∞]
- Stationary Markovian: V: S → [-∞, ∞]

- Value Function of a Policy: the utility function of the reward sequence returned from executing the policy, or the total utility of the total reward
- Vπ(hs,t) = u(Rtπhs,t, Rt+1πhs,t, …)

- A solution to an MDP is an optimal policy, or a policy that maximizes utility.
- Policy π* is optimal if the value function, V*, is greater than or equal to the value function of any other policy.
- V*(h) ≥ Vπ(h) for all h and π

- Need to be careful when defining utility u(R1, R2, …)
- For the same h, utility can be different across policy executions

- Existence and uniqueness are not guaranteed for many types of MDPs.

- u(R1, R2, …) = E(R1 + γR2 + γ2R3 …)where γis the discount factor
- Assume γ = 1 unless stated otherwise
- 0 ≤ γ < 1 : more immediate rewards are more valuable
- γ = 1 : rewards are equally valuable, independently of time
- γ > 1 : more distant rewards are more valuable

- The Optimality Principle: if every policy’s quality can be measured by this policy’s ELAU, there exists a policy that is optimal at every timestep
- There are some situations where it may not apply:
- When stuck in a repeating sequence of states (a loop)
- Infinite decision epochs
- Infinite utility

- Oscillating Utility
- Unbounded Utility

- Risk averse, or risk taking (ELAU is risk neutral)
- $1 million guaranteed (risk averse)
- 50% chance of $2 million, 50% chance of $0 (risk taking)
- Expected value is the same, so risk neutral would choose either

- Finite-horizon MDPs
- Infinite-horizon discounted-reward MDPs
- Stochastic shortest-path MDPs

- Assume the agent acts for a finite # of time steps, L
- Example applications:
- Inventory management“How much X to order fromthe supplier every day ‘tilthe end of the season?”
- Maintenance scheduling“When to schedule disruptive maintenance jobs by their deadline?”

Puterman, 1994

- FH MDP: an FH MDP is a tuple <S, A, D, T, R>
- S is a finite state space
- A is a finite action set
- D is a sequence of discrete decision epochs (time steps) up to a finite horizon L
- T: S x A x S x D → [0, 1] is a transition function (probability)
- R: S x A x S x D → ℝis a reward function

- For an FH MDP with horizon |D| = L < ∞, let:
- Vπ(hs,t) = Eπhs,t[R1 + … + RL – t] for all 1 ≤ t ≤ L
- Vπ(hs,L +1) = 0

- Then
- V* exists and is Markovian, π* exists and is det. Markovian
- For all sand1 ≤ t ≤ L:V*(s,t) = maxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ] π*(s,t) = argmaxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ]

- For an FH MDP with horizon |D| = L < ∞, let:
- Vπ(hs,t) = Eπhs,t[R1 + … + RL – t] for all 1 ≤ t ≤ L
- Vπ(hs,L +1) = 0

- Then
- V* exists and is Markovian, π* exists and is det. Markovian
- For all sand1 ≤ t ≤ L: V*(s,t) = maxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ] π*(s,t) = argmaxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ]

ELAU

For every history, the value of every policy is well-defined

}

}

Each E[Ri] is finite

}

# of terms in the series is finite

- For an FH MDP with horizon |D| = L < ∞, let:
- Vπ(hs,t) = Eπhs,t[R1 + … + RL – t] for all 1 ≤ t ≤ L
- Vπ(hs,L +1) = 0

- Then
- V* exists and is Markovian, π* exists and is det. Markovian
- For all sand1 ≤ t ≤ L: V*(s,t) = maxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ] π*(s,t) = argmaxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ]

ELAU

For every history, the value of every policy is well-defined

}

}

Each E[Ri] is finite

}

# of terms in the series is finite

Immediate utility of the next action

If you act optimally now

{

{

}

}

}

Highest utility derivable from the next state

Highest utility derivable from s at time t

In expectation

- If V* and π* are Markovian, then we only need to consider Markovian V and π
- Easy to compute π*
- For all s, compute V*(s, t) and π*(s, t) for t = L, …, 1

- Why go beyond the finite horizon?
- Autonomous agents with long lifespans (elevators, investments, airplanes, etc.)

- Infinite Horizon
- Known to be infinite (can continue indefinitely)

- Indefinite Horizon
- Known to be unbounded (finite processes that can be delayed or extended but will eventually reach a terminal state)

- Due to the infinite nature of D, we must define stationary, or time independent functions:
- T: S x A x S → [0, 1] is a transition function (probability)
- R: S x A x S → ℝis a reward function
- π: S → A (this is also Markovian)
- V: S → [-∞, ∞] (this is also Markovian)

- IHDR MDP: an IHDR MDP is a tuple <S, A, T, R, γ>
- S is a finite state space
- A is a finite action set
- T: S x A x S → [0, 1] is a transition function (probability)
- R: S x A x S → ℝis a reward function
- γ is a discount factor between 0 and 1 (favors immediate rewards)

- Policy value = discounted ELAU over infinite time steps

- For an IHDR MDP, let:
- Vπ(h) = Eπh[R1 + γR2 + γ2R3 …] for all h

- Then
- V* exists and is stationary Markovian, π* exists and is stationary deterministic Markovian
- For all s:V*(s) = maxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ] π*(s) = argmaxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ]

- For an IHDR MDP, let:
- Vπ(h) = Eπh[R1 + γR2 + γ2R3 …] for all h

- Then
- V* exists and is stationary Markovian, π* exists and is stationary deterministic Markovian
- For all s:V*(s) = maxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ] π*(s) = argmaxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ]

ELAU

For every history, the value of a policy is well-defined thanks to 0 ≤ γ < 1

}

}

}

All γE[Ri] are bounded by some finite K and converge geometrically

- For an IHDR MDP, let:
- Vπ(h) = Eπh[R1 + γR2 + γ2R3 …] for all h

- Then
- V* exists and is stationary Markovian, π* exists and is stationary deterministic Markovian
- For all s:V*(s) = maxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ] π*(s) = argmaxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ]

ELAU

For every history, the value of a policy is well-defined thanks to 0 ≤ γ < 1

}

}

}

All γE[Ri] are bounded by some finite K and converge geometrically

Future utility is discounted

Optimal utility is time independent

{

{

- If V* and π* are stationary Markovian, then we only need to consider stationary Markovian V and π

- γcan affect optimal policy significantly
- γ = 0 + ε: yields myopic policies for impatient agents
- γ = 1 - ε: yields far-sighted policies, inefficient to compute

- How to set it?
- Sometimes suggested by data (inflation rate, interest rate, tax rate)
- Often set arbitrarily to a value that gives a reasonable policy

- Assume the agent pays a cost to achieve a goal
- Example applications:
- Controlling a Mars rover“How to collect scientific data without damaging the rover?”
- Navigation“What’s the fastest wayto get to a destination, takinginto account the traffic jams?”

- Cost is often time or a physical resource

- SSP MDP: an SSP MDP is a tuple <S, A, T, C, G>
- S is a finite state space
- A is a finite action set
- T: S x A x S → [0, 1] is a stationary transition function (probability)
- C: S x A x S → ℝis a stationary cost function (R = -C)
- G ⊆ S is a set of absorbing cost-free goal states

- Under two conditions:
- There is at least one proper policy (reaches goal with P = 1 from all states)
- Every improper policy incurs a cost of infinity from every state from which it does not reach the goal with P = 1

- In SSP, maximizing ELAU = minimizing expected cost
- Every cost-minimizing policy is proper
- Thus, an optimal policy is the cheapest way to reach the goal
- Why are SSP MDPs called “indefinite-horizon?”
- If a policy is optimal, it will take a finite, but a priori unknown time to reach the goal (time to goal is dependent on the evaluation of the policy).
- At the limit as t approaches infinity, the probability that a goal state has been reached approaches P = 1

- For an SSP MDP, let:
- Vπ(h) = Eπh[C1 + C2 + …] for all h

- Then
- V* exists and is stationary Markovian, π* exists and is stationary deterministic Markovian
- For all s:V*(s) = maxa in A [Σs′ in S T(s, a, s′) [ C(s, a, s′) + V*(s′) ] ] π*(s) = argmaxa in A [Σs′ in S T(s, a, s′) [ C(s, a, s′) + V*(s′) ] ]

- For an SSP MDP, let:
- Vπ(h) = Eπh[C1 + C2 + …] for all h

- Then
- V* exists and is stationary Markovian, π* exists and is stationary deterministic Markovian
- For all s:V*(s) = maxa in A [Σs′ in S T(s, a, s′) [ C(s, a, s′) + V*(s′) ] ] π*(s) = argmaxa in A [Σs′ in S T(s, a, s′) [ C(s, a, s′) + V*(s′) ] ]

ELAU

For every history, the value of a policy is well-defined

}

Every policy either takes a finite exp. # of steps to reach a goal, or has infinite cost

- FH => SSP: turn all states (s, L) into goals
- IHDR => SSP: add (1 – γ)-probability transitions to goal
- Focusing on SSP allows us to develop one set of algorithms to solve all three classes of MDPs.

- We are only concerned with using flat representation
- This is the name for the representation already introduced on the definition slides
- It is easier to solve MDPs in flat representation, and it is much easier to describe larger MDPs in flat representation

- If you are interested in factored representation, read Section 2.5

- Solving IHDR, SSP in flat representation is P-complete
- Solving FH in flat representation is P-hard
- They don’t benefit from parallelization, but are solvable in polynomial time

- Brute-Force Alogrithm (3.1)
- Policy Evaluation (3.2)

- Go over all policies π
- How many? |A||S|, a finite amount

- Evaluate each policy
- Vπ(s), the expected cost of reaching the goal from s

- Choose the best, π*
- SSP optimality principle tells us that a best exists
- Vπ*(s) ≤ Vπ(s)

- Given a policy π, compute Vπ
- To start out, assume that π is proper
- Execution of π reaches a goal from any state

- Policy graph for π
- π(s0) = a0; π(s1) = a1

- Vπ(s1) = 1
- Vπ(s0) = 5 + 1 = 6

- Policy graph for π
- Vπ(s1) = 1
- Vπ(s2) = 4
- Vπ(s0) = 0.6(5 + 1) + 0.4(2 + 4) = 6

- Policy graph for π
- Vπ(s1) = 1
- Vπ(s2) = 0.7(4) + 0.3(3 + Vπ(s0))
- Vπ(s0) = 0.6(5 + 1) + 0.4(2 + 0.7(4) + 0.3(3 + Vπ(s0)))

- Generalized system of equations
- Vπ(sg) = 0
- Vπ(s1) = 1 + Vπ(sg)
- Vπ(s2) = 0.7(4 + Vπ(sg)) + 0.3(3 + Vπ(s0))
- Vπ(s0) = 0.6(5 + Vπ(s1)) + 0.4(2 + Vπ(s2))

- Constructing the system of equations
- Vπ(s) = 0 if s ∈ G
- Vπ(s) = ∑s′ ∈ S T(s, π(s), s′) [C(s, π(s), s′) + Vπ(s′)]

- |S| variables
- O(|S|3) running time

- Vnπ(s) ← ∑s′ ∈ S T(s, π(s), s′) [C(s, π(s), s′) + Vn-1π(s′)]
- Iterative solution

- Vπ(s) = ∑s′ ∈ S T(s, π(s), s′) [C(s, π(s), s′) + Vπ(s′)]
- System of equations, for comparison

- For a proper policy π,Iterative policy evaluation converges to the true value of the policy, i.e.limn→∞ Vnπ = Vπirrespective of the initialization V0π

- Residual: the magnitude of the change in the value of state s at iteration n in the algorithm.
- residualn(s) = | Vnπ(s) – Vn-1π(s)|
- residualn = maxs ∈ S residualn(s)

- ϵ-consistency: the residual of the value function in iteration n + 1 is less than ϵ
- A state s is ϵ-consistent if the value function is ϵ-consistent at s

- When iterative policy evaluation is run to ϵ-consistency, then Vnπ satisfies the following inequality:∀s ∈ S: | Vnπ(s) – Vπ(s)| < ϵN π(s)Where N π(s) is the expected number of steps to reach goal g from s by following π

Each iteration is O(|S|2) time

- MDP Definition (2.1)
- Solutions of an MDP (2.2)
- Solution Existence (2.3)
- Stochastic Shortest-Path MDPs (2.4)
- Complexity of Solving MDPs (2.6)

- Brute-Force Alogrithm (3.1)
- Policy Evaluation (3.2)

- General components of an MDP
- Three types of MDPs (FH, IHDR, SSP)
- Utility and cost (ELAU)
- Ways to solve MDPs (Sys of equations, Iterative)