180 likes | 182 Views
Relating Reinforcement Learning Performance to Classification performance. Presenter: Hui Li Sept.11, 2006. Outline. Motivation Reduction from reinforcement learning to classifier learning Results Conclusion. Motivation. A Simple Relationship:.
E N D
Relating Reinforcement Learning Performance to Classification performance Presenter: Hui Li Sept.11, 2006
Outline • Motivation • Reduction from reinforcement learning to classifier learning • Results • Conclusion
Motivation A Simple Relationship: The goal of reinforcement learning: The goal of (binary) classifier learning:
Motivation Question: • The problem of classification has been intensively investigated • The problem of reinforcement learning is still under investigation • Is it possible to reduce the reinforcement learning to classifier learning ?
Reduction Definition: • What is reinforcement learning problem A reinforcement learning problem D is defined as a conditional probability table D(o’,r|(o,a,r)*,o,a) on a set of observations O and rewards r[0,) given any history of past observations (o,a,r)*, actions (from action set A), and rewards.
Reduction 2. What is reinforcement learning goal Given some horizon T, find a policy , Maximizing the expected sum of rewards:
Reduction How to reduce a reinforcement learning problem to a cost-sensitive classification problem • How to obtain training examples • How to obtain training label • How to define the cost of misclassification
Reduction A illustration of trajectory tree • M = {S, A, D, Ps,a} • Two actions {0,1} • non-stationary policy
Reduction The value of the policy of a single step is estimated Which is explicitly written by The goal is the i-th realization
Reduction Value of the policy of a single step S0 = s0n S0 = s01 S0 = s02 . . . … a = 0 … … a = 0 a = 0 a = 1 a = L-1 a = 1 a = L-1 a = 1 a = L-1 S1|0 S1|0 S1|0 S1|1 S1|L-1 S1|1 S1|L-1 S1|1 S1|L-1 the i-th realization
Reduction One step reduction: One step reinforcement learning problem Cost-sensitive classifier learning problem
Reduction where • s0i: the ith sample (or data) • : label • wi: the costs of classifying example i to each of of possible labels.
Reduction Properties of cost • The cost for misclassification is always positive • The cost for correct classification is zero • The larger the difference between the possible actions in terms of future reward, the larger the cost (or weight)
Reduction T-step MDP reduction How to find good policies for a T-step MDP by solving a sequence of weighted classification problems T-step policy =(0, 1, … T-1) • When updating t, hold the rest constant • When updating t, the trees are pruned form the root to stage t by keeping only the branch which agree with controls 0, 1, … t-1
Reduction Rewards accumulated along the branch which agrees with the controls t+1, t+2, … T-1 Immediate reward Realization of the reward follows actions at stage t
Illustrative Example Two-step MDP problem: • Continuous state space S = [0, 1] • Binary action Space A = {0,1} • Uniform distribution over the initial state
Illustrative Example Value function
Illustrative Example Path Taken by the algorithm