w learning competition among selfish q learners
Skip this Video
Download Presentation
W-Learning: Competition Among Selfish Q-Learners

Loading in 2 Seconds...

play fullscreen
1 / 34

W-Learning: Competition Among Selfish Q-Learners - PowerPoint PPT Presentation

  • Uploaded on

W-Learning: Competition Among Selfish Q-Learners. Presented by Alp Sardağ. Autonomous Mobile Robots:. Behaviour Based AI: emphasizing intelligence as emerging from ongoing interaction with the world. Subsumption Architecture: By Brooks. Ideas Of Subsumption Arcihtecture.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'W-Learning: Competition Among Selfish Q-Learners' - madra

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
autonomous mobile robots
Autonomous Mobile Robots:
  • Behaviour Based AI: emphasizing intelligence as emerging from ongoing interaction with the world.
    • Subsumption Architecture: By Brooks
ideas of subsumption arcihtecture
Ideas Of Subsumption Arcihtecture
  • Default Behaviour: ‘Avois All things’ layer1 takes control of the robot whenever ‘look for food’ layer2 is idle.
  • Multiple parallel goals: Which to give control?
the action selection problem
The Action Selection Problem
  • Brooks gives to the modules full sensing-and acting powers, but action-selection is the job of the programmer
  • W-learning modules are competing for control, At this kind of robots action-selection is not designed but learnt.
competition among selfish agents
Competition Among Selfish Agents
  • Make Layers peers
  • Layers Compete for control
definition and terms
Definition and Terms
  • The collection of agents A1,...,An are :
    • Selfish agents
    • No cooperation
    • No knowledge of others
  • Each agent Ai suggests an action ai(x) where x the world state.
  • The robot chooses one of these actions ak(x) and executes it.
how the robot works
How the Robot Works?
  • Some way of resolving the competition.
  • The idea: Agents have always an action to suggest, but it will care some times more than others.


‘avoid the predator’ and ‘wander around looking for food’

how to resolve the competition
How to resolve the competition
  • Each agent suggests some action ai(x) with weight Wi(x), the robot executes action ak(x) where:

Wk(x)=max Wi(x) where i1,2,...,n

  • Ak is the leader of the competition for state x.
  • No agent is explicitly aware of the existence of any other.
  • An agent can still ‘use’ another agent by ceding to control.
w values as action selection
W-Values as Action-Selection
  • As opposed to agents that share information and make a compromise action, This is a winner-take-all action selection scheme.
  • The division of control is state-based rather than time-based.
    • Blumberg points out that animals sometimes appear to engage in a form of time-sharing.
    • Same effect can be achieved by a suitable state representation x.
  • Let x=(e,i) be the state.
  • e: information from external sensors.
  • i:(f,c) information from external sensors.
  • f:very hungry(2),hungry(1),not hungry(0)
  • c:very dirty(2),dirty(1),clean(0)
  • The weights may be:

Wf((e(2,c))) > Wf((e(1,c))) > Wc((e(f,2))) >

Wf((e(0,c))) > Wc((e(f,1))) > Wf((e(f,0)))

engage opportunistic behaviour
Engage Opportunistic Behaviour
  • Hungry and Thirsty animal example: Food is only found in the north, water in the south. The animal treks north, eats and as soon as its hunger only partially satisfied thirst is now higher. Even before it got south, it wiil be starving again.
    • 1st solution: time-based agents, get control for some minimum amount of time.
    • 2nd solution: the agents can tell the difference between immediate and distant likely payoff, and present W-values accordingly.
  • Assigning W-values to actions:
    • Previous work: as a design problem
    • Using learning methods that automatically assign values to actions.
reinforcement learning
Reinforcement Learning
  • By trial-and-error, the agent learns to take the actions which maximise its rewards.
q learning


















(a)Simple Stocastic Environment

(b)Mij is provided in PL, Maij is provided in AL

NOTE: Transitions are probabilistic. Pxa(y) is the probability

that doing a in x will lead to state y.

q learning16
  • The agent is interested not in immediate rewards, but in the total discounted reward.

R=rt+rt+1+ 2rt+2+... where 0 <1

  • The expected total discounted reward:

V(xt)=E(R)=E(rt)+ E(rt+1)+ 2E(rt+2)+...

=E(rt)+ [E(rt+1)+ E(rt+2)+...]

=E(rt)+ V(xt+1)

=r rPxa(r)+ yV(y)Pxa(y)

q learning17
  • In learning phase the agent try to build up Q-values for each pair (x,a).
    • Temporal Difference Learning:

Q(x,a)  Q(x,a) +(r+ maxbQ(y,b) - Q(x,a))

where  learning rate and  discount rate.

  • Convergence of Q-learning:

Q(x,a)  (1- )Q(x,a) +(r+ maxbQ(y,b))

where  takes decreasing successive values 1, 2,...

let n(x,a) =1,2,... The number of times (x,a) visited



q learning18

R+ if n

u otherwise

  • The optimal policy:

*(x)=a*(x) where


  • Exploration problem:
  • A new approach to exploration problem:
  • U(i)  R(i) + maxa F(jMaijU(j),N(a,i)) where
  • F(u,n) =
multi module rl
Multi-Module RL
  • Most work in RL focused on single agents. In theory any problem can be seen just another IO mapping to be learnt by a single agent. Scalibility problem leads to combine simple agents to solve complex task. Some approaches are:
    • Top-down: identifying task and decompose it into subtasks. Moore by hand, Tham learn the decomposition where subtasks combine sequentially to solve main task.
    • Bottom-up: the behaviour that emerges when multiple RL agents are combined in different ways. Tan studies the benefits of cooperation among agents like ants.
selfish q learners
Selfish Q-learners:
  • Each agent is a Q-learning agent, with its own reward function and Q-values.
  • Co-operation is involuntary and emerges from competition among agents.
  • Let agents be A1,…,An

The robot works:

observe x

for (all agents):

get sugested action ai with strength Wi(x)

find Wk(x)=max Wi(x)

execute ak

observe y

for (all agents):

get reward ri

update Q andor W

w values
  • For updating W, use the numerical Q-values.
    • Static W-values: the agent promote its action with the same strength no matter what its competition.


    • W=importance: W could be the difference between suggested action and the worst possible action:


dynamic learnt w values
Dynamic (learnt) W-values
  • Previous W-values fail to take into account what the other agents are doing.
    • Examples:
      • suggested actions may be the same.
      • The other agent might be suggesting an action which would be disastrous for the other agent.
  • Two types of Ai need not compete for:
    • A state which is relatively unimportant to it.
    • A state which is important but some agent Ak suggesting an action which is good for Ai.
meaning of w values
Meaning of W-values
  • W=(P-A): the difference between P (what is predicted if we are listened to) and actual reward A (what actually happened).
    • An agent will not need explicit knowledge about who it is competing with. It will be aware of them when they stop its action being obeyed, and will be aware of the y and r caused as a result.
    • The agents will set their own W-values in an incremental way using Q-values.
w learning
  • Q-learning process:

P:=(1-Q)P+ Q(A)

  • W-learning process:

W:=(1-w)W+ w(P-A)

  • For updating Q-values:

Qi(x,ak):=(1- Q)Qi(x,ak)+ Q(ri+maxbQi(y,b))

  • For updating W-values:

Wi(x)=(1- w)Wi(x)+ w(Qi(x,ai)-(ri+maxbQi(y,b))

NOTE: only agents that were not obeyed are updated

w learning pseudo code
W-learning pseudo-code

State x := observe();

For ( all i )

a[i] := A[i].suggestAction(x);

Find k

Execute ( a[k] );

State y := observe();

For ( all i )


r[i] := A[i].reward(x,y);

A[i].updateQ ( x , a[k] , y , r[i] );

if (i!=k)

A[i].updateW(x , a[k] , y , r[i] );


learning q somewhat before learning w
Learning Q (somewhat) Before learning W
  • Ideally ‘Learn Q first, then W’.
  • It is impossible to learn Q completely in finite time.
  • Alternatively, learning W while Q is still being learnt:

Wi(x)=(1- w)Wi(x)+ w(1- Q)T(Qi(x,ai)-(ri+maxbQi(y,b))

where T >0 is the delaying rate.

after q has been learnt
After Q has been learnt
  • Imagine a dynamically changing collection with agents being continually created and destroyed over time, and the suriving agnets adjusting their W-values as the nature of their competiton changes. Q is leant once, whereas W is relearnt again.
  • Edelman’s biological theory of Neural Darvinism.
self modifying w values
Self-modifying W-values
  • The update of W for Ai if ak chosen:

Wi(x)=(1- w)Wi(x)+ wdki(x)

where dki(x) is the difference between P and A

  • If Ak leads from start to infinity:


This is why we don’t update Wk(x) because E(dkk(x))=0

  • Benefit:
    • W-learning algorithm can handle any number of switches of leader.
will competition ever be resolved
Will competition ever be resolved?
  • What we need to show the leader will not keep changing forever.
convergence of w learning
Convergence of W-learning

This process will terminate within n2 steps, resolving

competition with a winner:

Wk(x)E(dki(x)) i, ik

remark1 more than one possible winner
Remark1: More than one possible winner


0 3 0

0 0 9

0 0 0


Start with all Wi(x)=0. Choose A2’s action:


Now A1 is the leader.

Start with all Wi(x)=0. Choose A3’s action:


Now A2 is the leader.

remark2 should we score winner s w
Remark2: Should we score winner’s W
  • Wk(x)  E(dkk(x)) = 0

the leader’s W converging to 0. Hence back and forth competition forever under any such system.

remark3 scaling peers and unequal agents
Remark3:Scaling, peers and unequal agents
  • An agent with high rewards will end up with high W-values.
  • The agents peers because they compete on the same basis.
  • All concerns may not be of equal importance.