Passive Reinforcement Learning

Passive Reinforcement Learning Ruti Glick Bar-Ilan university

Passive Reinforcement Learning • We will assume full observation • Agent has a fix policy π • Always executes π(s) • Goal – to learn how good the policy is • similar to policy evaluation • But – doesn’t have all the knowledge • Doesn’t know transition model T(s, a, s’) • Doesn’t know the reward function R(s)

+1 -1 start example • Our familiar 4x3 world • Policy is known: • Agent executes trails using the policy • Trail start at (1,1) and experience sequence of states till reach terminal state

+1 -1 start Example (cont.) • Typical trails may be: • (1,1)-.04(1,2)-.04(1,3)-.04(1,2)-.04(1,3)-.04 (2,3)-.04(3,3)-.04(4,3)+1 • (1,1)-.04(1,2)-.04(1,3)-.04(2,3)-.04(3,3)-.04 (3,2)-.04(3,3)-.04(4,3)+1 • (1,1)-.04(2,1)-.04(3,1)-.04(3,2)-.04(4,2)-1

The goal • Utility Uπ(s) : • Expected sum of discounted rewards obtain for policy π • May include learning model of environment

algorithms • Direct utility estimation (DUE) • Adaptive Dynamic Programming (ADP) • Temporal Difference (DT)

Direct utility estimation • Idea: • Utility of state is the expected total reward from that state onward • Each trail supply example/s of values for visited state • Reward to go(of a state) • the sum of the rewards from that state until a terminal state is reached

Example • (1,1)-.04(1,2)-.04(1,3)-.04(1,2)-.04(1,3)-.04(2,3)-.04(3,3)-.04(4,3)+1 • U(1,1) = 0.72 • U(1,2) = 0.76, 0.84 • U(1,3) = 0.80, 0.88 • U(2,3) = 0.92 • U(3,3) = 0.96

algorithm • Run over sequence of state (according to policy) • Calculate observed “reward to go” for visited states • Keeping average utility of each state in table

properties • After infinity number of trails, the average will converge to true expectation • Advantage • Easy to compute • No need of special actions • disadvantage • This is actually instance of supervised learning

disadvantage –expanding • Similarity to supervised learning • Each example has input (state) and output (observed reward to go) • Reduce reinforcement learning to inductive learning • lacking • Missed dependency of neighbor states • Utility of s = reward of s + expected utility of neighbors • Doesn’t use the connection between states for learning • Searches in hypothesis space larger than needs to • Algorithm converge very slowly

example • Second trail: (1,1)-.04(1,2)-.04(1,3)-.04(2,3)-04 (3,3)-.04(3,2)-.04(3,3)-.04(4,3)+1 • (3,2) hasn’t been seen before • (3,3) has been visited before and got high utility • Learn about (3,2) only at the end of sequence • Search in too much options…

Adaptive dynamic programming • take advantage of connection between states • Learn the transition model • Solve markov decision process • Running known policy • Learns from observed sequences T(s,π(s),s’) • Get R(s) from observed states • Calculate utilities of states • Use T(s,π(s),s’), R(s) in Bellman equation • Solve the linear equations • Instead might use simplified value iteration

Example • In our three trails performs 3 times right in (1,3) • 2 of these cases the result is (2,3) • So T((1,3), right, (2,3)) estimates as 2/3

The algorithm Function PASSIVE_ADP_AGENT (percept) returns an action input: percept, a percept indicating the current state s’ and reward signal r’ static: π, a fixed policy mdp, an MDP with model T, rewards R, discount γ U, a table of utilities, initially empty Nsa, a table of frequencies for state-action pairs, initially zero Nsas’,, a table of frequencies for state-action pairs, initially zero a, s, the previous state and action, initially null if s’ is new then doU[s’]r’; R[s’]r’ ifs is not null then do increment Nsa[s,a] and Nsas’[s,a,s’] for each t such that Nsas’[s,a,t] is nonzero do T[s,a,t]Nsas’[s,a,t]/Nsa[s,a] UVALUE_DETEMINATION(π,U,mdp) if TERMINAL?[s’] thens,anull elses,as’,π[s’] return a

Properties • Might seen like supervised learning • input = state-action pair • Output = resulting state • Its Easy learning the model • The environment is fully observation • Algorithm does well as possible • Provide standard for measuring reinforcement learning algorithms • Good for large state spaces • In backgammon solves 1050 equations with 1050 unknowns • Disadvantage – a lot of work each time iteration

Performance in 4x3 world

Temporal difference learning • Best of two world • Allows approximate the constraint equations • No need of solving equations for all possible states • method • Run according to policy π • Use observed transitions to adjust utilities that they agree with the constraint equations.

Example • As result of first trail • Uπ(1,3) = 0.84 • Uπ(2,3) = 0.92 • We hope to see that:U(1,3) = -0.04 + U(2,3) = 0.88 • So current estimate of 0.84 is a bit low and we must increase it.

In practice • Watching transition occurs from s to s’ • update equation:Uπ(s)  Uπ(s) + α(R(s) + γ Uπ(s’) − Uπ(s)) • α is learning rate parameter. • This is called temporal difference learning because update rule uses difference in utilities between successive states.

The algorithm Function PASSIVE_TD_AGENT (percept) returns an action input: percept, a percept indicating the current state s’ and reward signal r’ static: π, a fixed policy U, a table of utilities, initially empty Ns, a table of frequencies for states, initially zero a, s, r, the previous state, action and reward initially null if s’ is new then doU[s’]  r’ ifs is not null then do increment Ns[s] U[s]  U[s] + α(Ns[s])(r + γ U[s’] − U[s]) if TERMINAL?[s’] thens, a, r  null elses, a, r  s’,π[s’], r’ return a

Properties • Update involves only observed successor s’ • Doesn’t take into account all possibilities • efficient over large number of transitions • Does not learn the model • Environment supply the connection between neighboring states in form of observed transitions • Average value of Uπ(s) will converge to correct value

quality • Average value of Uπ(s) will converge to correct value • if defined as a function that decreases as the number of times a state is visited increases, then U(s) will converge to correct value. • We require: • The function (n) = 1/n satisfies these conditions.

Performance in 4x3 world

TD vs. ADP • TD: • Doesn’t learn as fast as ADP • Shows higher variability than ADP • Simpler than ADP • Much less computation per observation than ADP • Does not need a model to perform updates • Makes state updates to agree with observed successor (instead of all successors, like ADP) • TD can be viewed as a crude, yet efficient, first approximation to ADP.

PASSIVE_ADP_AGENT if s’ is new then doU[s’]r’; R[s’]r’ ifs is not null then do increment Nsa[s,a] and Nsas’[s,a,s’] for each t such that Nsas’[s,a,t] is nonzero do T[s,a,t]Nsas’[s,a,t]/Nsa[s,a] UVALUE_DETEMINATION(π,U,mdp) if TERMINAL?[s’] thens,anull elses,as’,π[s’] return a TD vs ADP Function PASSIVE_TD_AGENT (percept) returns an action if s’ is new then doU[s’]  r’ ifs is not null then do increment Ns[s] U[s]  U[s] + α(Ns[s])(r + γ U[s’] − U[s]) if TERMINAL?[s’] thens, a, r  null elses, a, r  s’,π[s’], r’ return a

Passive Reinforcement Learning