W learning competition among selfish q learners l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 34

W-Learning: Competition Among Selfish Q-Learners PowerPoint PPT Presentation


  • 96 Views
  • Uploaded on
  • Presentation posted in: General

W-Learning: Competition Among Selfish Q-Learners. Presented by Alp Sardağ. Autonomous Mobile Robots:. Behaviour Based AI: emphasizing intelligence as emerging from ongoing interaction with the world. Subsumption Architecture: By Brooks. Ideas Of Subsumption Arcihtecture.

Download Presentation

W-Learning: Competition Among Selfish Q-Learners

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


W learning competition among selfish q learners l.jpg

W-Learning:Competition Among Selfish Q-Learners

Presented by Alp Sardağ


Autonomous mobile robots l.jpg

Autonomous Mobile Robots:

  • Behaviour Based AI: emphasizing intelligence as emerging from ongoing interaction with the world.

    • Subsumption Architecture: By Brooks


Ideas of subsumption arcihtecture l.jpg

Ideas Of Subsumption Arcihtecture

  • Default Behaviour: ‘Avois All things’ layer1 takes control of the robot whenever ‘look for food’ layer2 is idle.

  • Multiple parallel goals: Which to give control?


The action selection problem l.jpg

The Action Selection Problem

  • Brooks gives to the modules full sensing-and acting powers, but action-selection is the job of the programmer

  • W-learning modules are competing for control, At this kind of robots action-selection is not designed but learnt.


Competition among selfish agents l.jpg

Competition Among Selfish Agents

  • Make Layers peers

  • Layers Compete for control


Definition and terms l.jpg

Definition and Terms

  • The collection of agents A1,...,An are :

    • Selfish agents

    • No cooperation

    • No knowledge of others

  • Each agent Ai suggests an action ai(x) where x the world state.

  • The robot chooses one of these actions ak(x) and executes it.


How the robot works l.jpg

How the Robot Works?

  • Some way of resolving the competition.

  • The idea: Agents have always an action to suggest, but it will care some times more than others.

    Example:

    ‘avoid the predator’ and ‘wander around looking for food’


How to resolve the competition l.jpg

How to resolve the competition

  • Each agent suggests some action ai(x) with weight Wi(x), the robot executes action ak(x) where:

    Wk(x)=max Wi(x) where i1,2,...,n

  • Ak is the leader of the competition for state x.


Example l.jpg

Example

  • No agent is explicitly aware of the existence of any other.

  • An agent can still ‘use’ another agent by ceding to control.


Example cont l.jpg

Example Cont.


W values as action selection l.jpg

W-Values as Action-Selection

  • As opposed to agents that share information and make a compromise action, This is a winner-take-all action selection scheme.

  • The division of control is state-based rather than time-based.

    • Blumberg points out that animals sometimes appear to engage in a form of time-sharing.

    • Same effect can be achieved by a suitable state representation x.


Example12 l.jpg

Example

  • Let x=(e,i) be the state.

  • e: information from external sensors.

  • i:(f,c) information from external sensors.

  • f:very hungry(2),hungry(1),not hungry(0)

  • c:very dirty(2),dirty(1),clean(0)

  • The weights may be:

    Wf((e(2,c))) > Wf((e(1,c))) > Wc((e(f,2))) >

    Wf((e(0,c))) > Wc((e(f,1))) > Wf((e(f,0)))


Engage opportunistic behaviour l.jpg

Engage Opportunistic Behaviour

  • Hungry and Thirsty animal example: Food is only found in the north, water in the south. The animal treks north, eats and as soon as its hunger only partially satisfied thirst is now higher. Even before it got south, it wiil be starving again.

    • 1st solution: time-based agents, get control for some minimum amount of time.

    • 2nd solution: the agents can tell the difference between immediate and distant likely payoff, and present W-values accordingly.

  • Assigning W-values to actions:

    • Previous work: as a design problem

    • Using learning methods that automatically assign values to actions.


Reinforcement learning l.jpg

Reinforcement Learning

  • By trial-and-error, the agent learns to take the actions which maximise its rewards.


Q learning l.jpg

.5

1.0

.5

.33

.33

.5

.33

1.0

.5

.33

.5

.5

.33

.5

.33

.5

.5

Q-learning

(a)Simple Stocastic Environment

(b)Mij is provided in PL, Maij is provided in AL

NOTE: Transitions are probabilistic. Pxa(y) is the probability

that doing a in x will lead to state y.


Q learning16 l.jpg

Q-learning

  • The agent is interested not in immediate rewards, but in the total discounted reward.

    R=rt+rt+1+ 2rt+2+... where 0 <1

  • The expected total discounted reward:

    V(xt)=E(R)=E(rt)+ E(rt+1)+ 2E(rt+2)+...

    =E(rt)+ [E(rt+1)+ E(rt+2)+...]

    =E(rt)+ V(xt+1)

    =r rPxa(r)+ yV(y)Pxa(y)


Q learning17 l.jpg

Q-learning

  • In learning phase the agent try to build up Q-values for each pair (x,a).

    • Temporal Difference Learning:

      Q(x,a)  Q(x,a) +(r+ maxbQ(y,b) - Q(x,a))

      where  learning rate and  discount rate.

  • Convergence of Q-learning:

    Q(x,a)  (1- )Q(x,a) +(r+ maxbQ(y,b))

    where  takes decreasing successive values 1, 2,...

    let n(x,a) =1,2,... The number of times (x,a) visited

    (x,a)=1n(x,a)

    =1,12,13,...


Q learning18 l.jpg

{

R+if n<Ne

uotherwise

Q-learning

  • The optimal policy:

    *(x)=a*(x) where

    =maxaQ(x,a)

  • Exploration problem:

  • A new approach to exploration problem:

  • U(i)  R(i) + maxa F(jMaijU(j),N(a,i)) where

  • F(u,n) =


Multi module rl l.jpg

Multi-Module RL

  • Most work in RL focused on single agents. In theory any problem can be seen just another IO mapping to be learnt by a single agent. Scalibility problem leads to combine simple agents to solve complex task. Some approaches are:

    • Top-down: identifying task and decompose it into subtasks. Moore by hand, Tham learn the decomposition where subtasks combine sequentially to solve main task.

    • Bottom-up: the behaviour that emerges when multiple RL agents are combined in different ways. Tan studies the benefits of cooperation among agents like ants.


Selfish q learners l.jpg

Selfish Q-learners:

  • Each agent is a Q-learning agent, with its own reward function and Q-values.

  • Co-operation is involuntary and emerges from competition among agents.

  • Let agents be A1,…,An

    The robot works:

    observe x

    for (all agents):

    get sugested action ai with strength Wi(x)

    find Wk(x)=max Wi(x)

    execute ak

    observe y

    for (all agents):

    get reward ri

    update Q andor W


W values l.jpg

W-values

  • For updating W, use the numerical Q-values.

    • Static W-values: the agent promote its action with the same strength no matter what its competition.

      W(x)=Q(x,a)

    • W=importance: W could be the difference between suggested action and the worst possible action:

      W(x)=Q(x,a)-minb(x,b)


Example22 l.jpg

Example


Dynamic learnt w values l.jpg

Dynamic (learnt) W-values

  • Previous W-values fail to take into account what the other agents are doing.

    • Examples:

      • suggested actions may be the same.

      • The other agent might be suggesting an action which would be disastrous for the other agent.

  • Two types of Ai need not compete for:

    • A state which is relatively unimportant to it.

    • A state which is important but some agent Ak suggesting an action which is good for Ai.


Meaning of w values l.jpg

Meaning of W-values

  • W=(P-A): the difference between P (what is predicted if we are listened to) and actual reward A (what actually happened).

    • An agent will not need explicit knowledge about who it is competing with. It will be aware of them when they stop its action being obeyed, and will be aware of the y and r caused as a result.

    • The agents will set their own W-values in an incremental way using Q-values.


W learning l.jpg

W-learning

  • Q-learning process:

    P:=(1-Q)P+ Q(A)

  • W-learning process:

    W:=(1-w)W+ w(P-A)

  • For updating Q-values:

    Qi(x,ak):=(1- Q)Qi(x,ak)+ Q(ri+maxbQi(y,b))

  • For updating W-values:

    Wi(x)=(1- w)Wi(x)+ w(Qi(x,ai)-(ri+maxbQi(y,b))

    NOTE: only agents that were not obeyed are updated


W learning pseudo code l.jpg

W-learning pseudo-code

State x := observe();

For ( all i )

a[i] := A[i].suggestAction(x);

Find k

Execute ( a[k] );

State y := observe();

For ( all i )

{

r[i] := A[i].reward(x,y);

A[i].updateQ ( x , a[k] , y , r[i] );

if (i!=k)

A[i].updateW(x , a[k] , y , r[i] );

}


Learning q somewhat before learning w l.jpg

Learning Q (somewhat) Before learning W

  • Ideally ‘Learn Q first, then W’.

  • It is impossible to learn Q completely in finite time.

  • Alternatively, learning W while Q is still being learnt:

    Wi(x)=(1- w)Wi(x)+ w(1- Q)T(Qi(x,ai)-(ri+maxbQi(y,b))

    where T >0 is the delaying rate.


After q has been learnt l.jpg

After Q has been learnt

  • Imagine a dynamically changing collection with agents being continually created and destroyed over time, and the suriving agnets adjusting their W-values as the nature of their competiton changes. Q is leant once, whereas W is relearnt again.

  • Edelman’s biological theory of Neural Darvinism.


Self modifying w values l.jpg

Self-modifying W-values

  • The update of W for Ai if ak chosen:

    Wi(x)=(1- w)Wi(x)+ wdki(x)

    where dki(x) is the difference between P and A

  • If Ak leads from start to infinity:

    Wi(x)E(dki(x))

    This is why we don’t update Wk(x) because E(dkk(x))=0

  • Benefit:

    • W-learning algorithm can handle any number of switches of leader.


Will competition ever be resolved l.jpg

Will competition ever be resolved?

  • What we need to show the leader will not keep changing forever.


Convergence of w learning l.jpg

Convergence of W-learning

This process will terminate within n2 steps, resolving

competition with a winner:

Wk(x)E(dki(x)) i, ik


Remark1 more than one possible winner l.jpg

Remark1: More than one possible winner

)

0 3 0

0 0 9

0 0 0

(

Start with all Wi(x)=0. Choose A2’s action:

W1(x)=(1-1)x0+1.d21=3

Now A1 is the leader.

Start with all Wi(x)=0. Choose A3’s action:

W2(x)=(1-1)x0+1.d32=9

Now A2 is the leader.


Remark2 should we score winner s w l.jpg

Remark2: Should we score winner’s W

  • Wk(x)  E(dkk(x)) = 0

    the leader’s W converging to 0. Hence back and forth competition forever under any such system.


Remark3 scaling peers and unequal agents l.jpg

Remark3:Scaling, peers and unequal agents

  • An agent with high rewards will end up with high W-values.

  • The agents peers because they compete on the same basis.

  • All concerns may not be of equal importance.


  • Login