COSC 6342 Project 1 Spring 2014 Q-Learning for a Pickup Dropoff World

COSC 6342Project 1 Spring 2014Q-Learning fora Pickup Dropoff World P P D D P D

Terminal State: Drop off cells contain 5 blocks each Initial State: Agent is in cell (1,5) and pickup cells contain 5 blocks PD-World (1,1) (1,2) (1,3) (1,4) (1,5) Goal: Transport from pickup cells to dropoff cells! (2,2) (2,1) (2,3) (2,4) (2,5) (3,1) (3,2) (3,3) (3,5) (3,4) (4,1) (4,2) (4,3) (4,4) (4,5) (5,1) (5,2) (5,4) (5,5) (5,3) Pickup: Cells: (1,1), (3,3),(5,5) Dropoff Cells: (5,1), (5,3), (4,5)

Spring 2014 PD-World P P D D P D Operators‒there are six of them: North, South, East, West are applicable in each state, and move the agent to the cell in that direction except leaving the grid is not allowed. Pickup is only applicable if the agent is in an pickup cell that contain at least one block and if the agent does not already carry a block. Dropoffis only applicable if the agent is in a dropoff cell that contains less that 5 blocks and if the agent carries a block. Initial state of the PD-World: Each pickup cell contains 5 blocks and dropoff cells contain 0 blocks; the agent always starts in position (1,5)

Spring 2014 P State Space PD-World P D D P D • The actual state space of the PD World is as follows: • (i, j, x, a, b, c, d, e, f) with • (i,j) is the position of the agent • x is 1 if the agent carries a block and 0 if not • (a,b,c,d,e,f) are the number of blocks in cells • (1,1), (3,3),(5,5), (5,1), (5,3), and (4,5), respectively • Initial State: (1,1,0,5,5,5,0,0,0) • Terminal State: (*,*,0,0,0,0,5,5,5) • Remark: The actual reinforcement learning approach likely • will use a simplified state space that aggregates multiple states • of the actual state space into a single state in the reinforcement • learning state space.

Spring 2014 Rewards in the PD-World P P D D P D • Rewards: • Picking up a block from a pickup state: +12 • Dropping off a block in a dropoff state: +12 • Applying north, south, east, west: -1.

Spring 2014 Project1 Policies • PRandom: If pickup and dropoff is applicable, • choose this operator; otherwise, choose an operator randomly. • PExploit1: If pickup and dropoff is applicable, choose this • operator; otherwise, apply the applicable operator with the • highest q-value (break ties by rolling a dice for operators with • the same utility) with probability 0.6 and choose an • applicable operator randomly with probability 0.4. • PExploit2: If pickup and dropoff is applicable, choose this • operator; otherwise, apply the applicable operator with the • highest q-value (break ties by rolling a dice…) with probability • 0.85 and choose an applicable operator randomly with • probability 0.15.

Performance Measures • Bank account of the agent • Rewards received over number of operators applied over the whole time window or the last 40 operator applications • Blocks delivered over number of operators applied over the whole time window or the last 40 operator applications

Reinforcement Learning Search Space1 Reinforcement learning states have the form (i,j,x,s,t,u) where • (i,j) is the position of the agent • x is 1 if the agent carries a block; otherwise, 0. • g, h, i are boolean variables whose meaning depend on, if the agent carries a block or not. • Case 1: x=0 (agent does not carry a block) • s is 1, if cell (1,1) contains at least one block • t is 1, if cell (3,3) contains at least one block • u is 1, if cell (5,5) contains at least one block • Case 2: x=1 (agent does carry a block) • s is 1, if cell (5,1) contains less than 5 blocks • t is 1, if cell (5,3) contains less than 5 blocks • u is 1, if cell (4,5) contains less than 5 blocks • There are 400 states total in the reinforcement learning state space1

AlternativeReinforcement Learning Search Space2 In this approach reinforcement learning states have the form (i,j,x) where • (i,j) is the position of the agent • x is 1 if the agent carries a block; otherwise, 0. That is in RL Space2 there are only 50 states. Discussion: • The problem with state space2 is that the algorithm initially learns paths between pickup states and dropoff states but the q-values will decrease is soon as the pickup states runs out of blocks or a dropoff state is full, and cannot receive any further blocks, as it is no longer attractive to visit these states. Therefore, these path have to be relearned when an agent is restarted to solve the same problem again using the final Q-table of the previous run. This problem does not exist, when the RL state space1 is used. • On the other hand, when using the recommended search space, if one of the variables s, t, u switches from 1 to 0 all paths need to be relearned.

Analysis of Attractive Paths • … • … Demo: http://www2.hawaii.edu/~chenx/ics699rl/grid/gridworld.html

Remark: This is the QL approach you must use in Project1! TD-Q-Learning for the PD-World Goal: Measure the utility of using action a in state s, denoted by Q(a,s); the following update formula is used every time an agent reaches state s’ from s using actions a: Q(a,s)  (1-)*Q(a,s) + *[R(s’,a,s)+ γ*maxa’Q(a’,s’)] • is the learning rate; g is the discount factor • R(s’,a ,s) is the reward of reaching s’ from s • by applying a; e.g. -1 if moving, +12 if picking • up or dropping blocks for the PD-World.

COSC 6342 Project 1 Spring 2014 Q-Learning for a Pickup Dropoff World