1 / 20

Learning-based methods to MDP and POMDP (Reinforcement learning & Q-Learning)

Learning-based methods to MDP and POMDP (Reinforcement learning & Q-Learning). Presented by, Niketan Pansare ( np6@rice.edu ) Rice University. Outline. - Reinforcement Learning - Exploration-Exploitation tradeoff - Q-Learning and SARSA - Learning-based methods to POMDP

jonah
Download Presentation

Learning-based methods to MDP and POMDP (Reinforcement learning & Q-Learning)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning-based methods to MDP and POMDP (Reinforcement learning & Q-Learning) • Presented by, • Niketan Pansare (np6@rice.edu) • Rice University

  2. Outline - Reinforcement Learning - Exploration-Exploitation tradeoff - Q-Learning and SARSA - Learning-based methods to POMDP - POMDP with noisy sensor - Offline Learning

  3. Quick recap Input Output T(.) St St+1 How long ? - Finite horizon - Infinite horizon (Discounted or Average) Stochastic finite state machine Environment - Economic ($ today, 0.8 tom) - Limited life time of agent - Mathematical convenience Short-term Action Reward What actions to take so as to maximize long-term reward ? Agent Policy: ∏(St) → At

  4. Quick recap We don’t know the model Computation time using policy iter: O(states^2) = O(2^2k) ... state=k var T(.) St St+1 Access to state St Environment Modeled by MDP <S,A,T,R> - St trivial based on Ot Modeled by POMDP - Knowledge of env = uncertain Focus of reinforcement learning Observation + 2 problems Action Reward Agent Policy: ∏(St) → At For ML ppl: Supervisor labels <input, some output, grade> instead of <input, correct output> Eg: In Chess, <board, arbitrary move, how good the move is> instead of <board, move to win>

  5. T(.) St St+1 Environment Other 2 fundamental problems in Reinforcement learning 2. Delayed reward (eg: Chess - win or sacrifice queen) 3. Generalization Observation + Action Reward 1. Exploration v/s Exploitation Should I improve my knowledge so that later I get more rewards ? What actions to take so as to maximize long-term reward ? Agent Policy: ∏(St) → At

  6. Outline - Reinforcement Learning - Exploration-Exploitation tradeoff - Q-Learning and SARSA - Learning-based methods to POMDP - POMDP with noisy sensor - Offline Learning

  7. Exploration v/s Exploitation tradeoff n-arm bandit = n actions - Uncertain knowledge about Mean(reward_i) Planning (select optimal policy) = trivial Learning - Faced with exploration-exploitation tradeoff Epsilon-greedy algorithm - Select best arm (based on previous observation) with probability 1 - epsilon (Exploit) - Select randomly selected other arms with probability epsilon (Explore) Distinction between planning and learning blurry in reinforcement learning !!!

  8. # R program to simulate epsilon-greedy algorithm trueMeans <- c(10, 12, 15, 20, 18) trueSd <- c(10, 5, 8, 10, 5) approxSum <- rep(sum(trueMeans), length(trueMeans)) approxNum <- rep(1, length(trueMeans)) maxArm <- 1 reward <- function(arm) { rnorm(1, mean= trueMeans[arm], sd= trueSd) } myTempFile <- "/Users/niketan/Desktop/myTemp.dat" cat("", file= myTempFile, append=FALSE) for(epsilon in c(0.99, 0.5, 0.2, 0.1)) { sumRewardTillNow <- 0 for(i in 1:100) { if(rbinom(1, 1, epsilon)) { # Explore arm <- sample(1:5, 1) myReward <- reward(arm) sumRewardTillNow <- sumRewardTillNow + myReward approxSum[arm] <- approxSum[arm] + myReward approxNum[arm] <- approxNum[arm] + 1 if(approxSum[arm] / approxNum[arm] > approxSum[maxArm] / approxNum[maxArm] ) { maxArm <- arm } } else { # Exploit sumRewardTillNow <- sumRewardTillNow + reward(maxArm) } cat(epsilon, i, sumRewardTillNow/i, "\n", file= myTempFile, append=TRUE) } } dat1 <- read.table("/Users/niketan/Desktop/myTemp.dat") names(dat1) <- c("epsilon", "t", "expectedReward") ggplot(data=dat1, aes(x=t, y= expectedReward, colour= epsilon)) + facet_wrap(~ epsilon) + geom_line()

  9. Exploration-exploitation Boltzmann exploration: → Sensitive to action-value → Infinite exploration: Hold temperature fixed → Exploitation: Decrease temperature over time ∊-greedy exploration: → Infinite exploration: Hold ∊ fixed → Exploitation: Decrease ∊ over time → Insensitive to action-value

  10. Outline - Reinforcement Learning - Exploration-Exploitation tradeoff - Q-Learning and SARSA - Learning-based methods to POMDP - POMDP with noisy sensor - Offline Learning

  11. Deterministic algorithm: Only have access to stochastic estimate: Stochastic algorithm: Stochastic Approximation Mean = zero + bounded variance The stochastic algorithm converges with probability 1 to the same limit as the deterministic algorithm

  12. Quick recap T(.) St St+1 Environment Observation + Action Reward What actions to take so as to maximize long-term reward ? Agent Policy: ∏(St) → At

  13. Model-free learning using Q-Learning Expected discounted reward of executing action a in state s and behaving optimally afterwards Using best available action Expected discounted reward of next state Q-Learning algorithm: Initialize Q(s,a) Loop Chose a using ∊-greedy policy (Exploration) Execute a, observe reward and new state Update using: This + stochastic approx thm ensures convergence

  14. SARSA(⋋) 1. Use action chosen by ∊-greedy policy (i.e. remove “max”) ⇒ hence “on-policy” algorithm 2. Apply update for all states and actions. 3. Multiply the update part by a exponentially decay factor (for non-“1.” point) Problem with Q-Learning: ignores already discovered optimal policy (since “off-policy” algorithm) SARSA: three modifications to update rule of Q-Learning: What about POMDP ? - Replace Q(s, a) by Q(o, a) - Called “reactive memoryless” policies .... [Littman, 1995] - Assumes existence of POMDP, hence model-based approach. - Other model-based approach: treat POMDP as HMM w/o action [Chrisman, 1992]

  15. Outline - Reinforcement Learning - Exploration-Exploitation tradeoff - Q-Learning and SARSA - Learning-based methods to POMDP - POMDP with noisy sensor - Offline Learning

  16. Model-free methods for POMDP: • Store most recent observations and use Q-learning/SARSA • How recent ? • Finite fixed-length histories • History window + branch & bound [Littmann, 1994] • 2-3 observations + Sarsa [Loch & Singh, 1998] • Variable-length histories [McCallum, 1996 - Dissertation] • Store past interactions in terms of instances, It = <It-1,At,Rt,Ot> • Q(It, A)-Learning using kNN → Nearest Sequence Memory (NSM) • Converged faster than model-based approach (HMM) • Tree-structure → Utile Suffix memory (USM) • Leaves hold Q-value of instances • Split node if statistical difference in Q-values of children → Kolmogorov-Smirnov test • Q-learning using ∊-greedy policy • Use bayesian learning to select from set of candidate models (possible history trees) [Suematsu et. al. 1997, 1999]

  17. Noise Multiple sensors can correspond to same feature with probability State feature T(.) St St+1 Modeled offline USM can detect false +ve (different o at same location) using agent history, but not false -ve So, noisy USM (NUSM) uses “noisy sim” Environment Sensor Observation + Action Reward designed to detect state of certain feature of env Agent Policy: ∏(St) → At

  18. Model-based Offline learning • If model-free noisy POMDP ~ crude approximation • Following method works better: • Learn state space by executing model-free method. • Compute transition and observation functions by either: • Using already gathered experience. • Executing policy of model-free method with some exploration. • Use the resulting POMDP.

  19. Summary - Reinforcement Learning and how it fits into our MDP/POMDP framework - Exploration-Exploitation tradeoff - Q-Learning and its variant SARSA - Model-based/Model-free learning approaches to POMDP - Dealing with noisy sensors - Offline learning

  20. References - Guy Shani PhD thesis Chapter 3 - Reinforcement Learning: Tutorial by Satinder Singh - Reinforcement Learning: An Introduction by Sutton and Barto http://webdocs.cs.ualberta.ca/~sutton/book/ebook/the-book.html - Q-Learning: Technical Note - Watkins and Dayan - Planning with Markov Decision Processes - Mausam and Kolobov

More Related