Learning-based methods to MDP and POMDP (Reinforcement learning & Q-Learning)

Learning-based methods to MDP and POMDP (Reinforcement learning & Q-Learning) • Presented by, • Niketan Pansare (np6@rice.edu) • Rice University

Outline - Reinforcement Learning - Exploration-Exploitation tradeoff - Q-Learning and SARSA - Learning-based methods to POMDP - POMDP with noisy sensor - Offline Learning

Quick recap Input Output T(.) St St+1 How long ? - Finite horizon - Infinite horizon (Discounted or Average) Stochastic finite state machine Environment - Economic ($ today, 0.8 tom) - Limited life time of agent - Mathematical convenience Short-term Action Reward What actions to take so as to maximize long-term reward ? Agent Policy: ∏(St) → At

Quick recap We don’t know the model Computation time using policy iter: O(states^2) = O(2^2k) ... state=k var T(.) St St+1 Access to state St Environment Modeled by MDP <S,A,T,R> - St trivial based on Ot Modeled by POMDP - Knowledge of env = uncertain Focus of reinforcement learning Observation + 2 problems Action Reward Agent Policy: ∏(St) → At For ML ppl: Supervisor labels <input, some output, grade> instead of <input, correct output> Eg: In Chess, <board, arbitrary move, how good the move is> instead of <board, move to win>

T(.) St St+1 Environment Other 2 fundamental problems in Reinforcement learning 2. Delayed reward (eg: Chess - win or sacrifice queen) 3. Generalization Observation + Action Reward 1. Exploration v/s Exploitation Should I improve my knowledge so that later I get more rewards ? What actions to take so as to maximize long-term reward ? Agent Policy: ∏(St) → At

Exploration v/s Exploitation tradeoff n-arm bandit = n actions - Uncertain knowledge about Mean(reward_i) Planning (select optimal policy) = trivial Learning - Faced with exploration-exploitation tradeoff Epsilon-greedy algorithm - Select best arm (based on previous observation) with probability 1 - epsilon (Exploit) - Select randomly selected other arms with probability epsilon (Explore) Distinction between planning and learning blurry in reinforcement learning !!!

# R program to simulate epsilon-greedy algorithm trueMeans <- c(10, 12, 15, 20, 18) trueSd <- c(10, 5, 8, 10, 5) approxSum <- rep(sum(trueMeans), length(trueMeans)) approxNum <- rep(1, length(trueMeans)) maxArm <- 1 reward <- function(arm) { rnorm(1, mean= trueMeans[arm], sd= trueSd) } myTempFile <- "/Users/niketan/Desktop/myTemp.dat" cat("", file= myTempFile, append=FALSE) for(epsilon in c(0.99, 0.5, 0.2, 0.1)) { sumRewardTillNow <- 0 for(i in 1:100) { if(rbinom(1, 1, epsilon)) { # Explore arm <- sample(1:5, 1) myReward <- reward(arm) sumRewardTillNow <- sumRewardTillNow + myReward approxSum[arm] <- approxSum[arm] + myReward approxNum[arm] <- approxNum[arm] + 1 if(approxSum[arm] / approxNum[arm] > approxSum[maxArm] / approxNum[maxArm] ) { maxArm <- arm } } else { # Exploit sumRewardTillNow <- sumRewardTillNow + reward(maxArm) } cat(epsilon, i, sumRewardTillNow/i, "\n", file= myTempFile, append=TRUE) } } dat1 <- read.table("/Users/niketan/Desktop/myTemp.dat") names(dat1) <- c("epsilon", "t", "expectedReward") ggplot(data=dat1, aes(x=t, y= expectedReward, colour= epsilon)) + facet_wrap(~ epsilon) + geom_line()

Exploration-exploitation Boltzmann exploration: → Sensitive to action-value → Infinite exploration: Hold temperature fixed → Exploitation: Decrease temperature over time ∊-greedy exploration: → Infinite exploration: Hold ∊ fixed → Exploitation: Decrease ∊ over time → Insensitive to action-value

Deterministic algorithm: Only have access to stochastic estimate: Stochastic algorithm: Stochastic Approximation Mean = zero + bounded variance The stochastic algorithm converges with probability 1 to the same limit as the deterministic algorithm

Quick recap T(.) St St+1 Environment Observation + Action Reward What actions to take so as to maximize long-term reward ? Agent Policy: ∏(St) → At

Model-free learning using Q-Learning Expected discounted reward of executing action a in state s and behaving optimally afterwards Using best available action Expected discounted reward of next state Q-Learning algorithm: Initialize Q(s,a) Loop Chose a using ∊-greedy policy (Exploration) Execute a, observe reward and new state Update using: This + stochastic approx thm ensures convergence

SARSA(⋋) 1. Use action chosen by ∊-greedy policy (i.e. remove “max”) ⇒ hence “on-policy” algorithm 2. Apply update for all states and actions. 3. Multiply the update part by a exponentially decay factor (for non-“1.” point) Problem with Q-Learning: ignores already discovered optimal policy (since “off-policy” algorithm) SARSA: three modifications to update rule of Q-Learning: What about POMDP ? - Replace Q(s, a) by Q(o, a) - Called “reactive memoryless” policies .... [Littman, 1995] - Assumes existence of POMDP, hence model-based approach. - Other model-based approach: treat POMDP as HMM w/o action [Chrisman, 1992]

Model-free methods for POMDP: • Store most recent observations and use Q-learning/SARSA • How recent ? • Finite fixed-length histories • History window + branch & bound [Littmann, 1994] • 2-3 observations + Sarsa [Loch & Singh, 1998] • Variable-length histories [McCallum, 1996 - Dissertation] • Store past interactions in terms of instances, It = <It-1,At,Rt,Ot> • Q(It, A)-Learning using kNN → Nearest Sequence Memory (NSM) • Converged faster than model-based approach (HMM) • Tree-structure → Utile Suffix memory (USM) • Leaves hold Q-value of instances • Split node if statistical difference in Q-values of children → Kolmogorov-Smirnov test • Q-learning using ∊-greedy policy • Use bayesian learning to select from set of candidate models (possible history trees) [Suematsu et. al. 1997, 1999]

Noise Multiple sensors can correspond to same feature with probability State feature T(.) St St+1 Modeled offline USM can detect false +ve (different o at same location) using agent history, but not false -ve So, noisy USM (NUSM) uses “noisy sim” Environment Sensor Observation + Action Reward designed to detect state of certain feature of env Agent Policy: ∏(St) → At

Model-based Offline learning • If model-free noisy POMDP ~ crude approximation • Following method works better: • Learn state space by executing model-free method. • Compute transition and observation functions by either: • Using already gathered experience. • Executing policy of model-free method with some exploration. • Use the resulting POMDP.

Summary - Reinforcement Learning and how it fits into our MDP/POMDP framework - Exploration-Exploitation tradeoff - Q-Learning and its variant SARSA - Model-based/Model-free learning approaches to POMDP - Dealing with noisy sensors - Offline learning

References - Guy Shani PhD thesis Chapter 3 - Reinforcement Learning: Tutorial by Satinder Singh - Reinforcement Learning: An Introduction by Sutton and Barto http://webdocs.cs.ualberta.ca/~sutton/book/ebook/the-book.html - Q-Learning: Technical Note - Watkins and Dayan - Planning with Markov Decision Processes - Mausam and Kolobov

Learning-based methods to MDP and POMDP (Reinforcement learning & Q-Learning)