slide1
Download
Skip this Video
Download Presentation
Machine Learning Lecture outline

Loading in 2 Seconds...

play fullscreen
1 / 63

Machine Learning Lecture outline - PowerPoint PPT Presentation


  • 158 Views
  • Uploaded on

Multi-Agent Systems Lecture 10 University “Politehnica” of Bucarest 2005-2006 Adina Magda Florea [email protected] http://turing.cs.pub.ro/ blia_06. Machine Learning Lecture outline. 1 Learning in AI (machine learning) 2 Reinforcement learning 3 Learning in multi-agent systems

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Machine Learning Lecture outline' - daria


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Multi-Agent SystemsLecture 10University “Politehnica” of Bucarest2005-2006Adina Magda [email protected]://turing.cs.pub.ro/blia_06

machine learning lecture outline
Machine LearningLecture outline

1 Learning in AI (machine learning)

2 Reinforcement learning

3 Learning in multi-agent systems

3.1 Learning action coordination

3.2 Learning individual performance

3.3 Learning to communicate

3.4 Layered learning

5 Conclusions

1 learning in ai
1 Learning in AI
  • What is machine learning?

Herbet Simon defines learning as:

“any change in a system that allows it to perform better the second time on repetition of the same task or another task drawn from the same population (Simon, 1983).”

In ML the agent learns:

  • knowledge representation of the problem domain
  • problem solving rules, inferences
  • problem solving strategies

3

classifying learning
Classifying learning

In MAS learning the agents should learn:

  • what an agent learns in ML but in the context of MAS - both cooperative and self-interested agents
  • how to cooperate for problem solving - cooperative agents
  • how to communicate - both cooperative and self-interested agents
  • how to negotiate - self interested agents

Different dimensions

  • explicitly represented domain knowledge
  • how the critic component (performance evaluation) of a learning agent works
  • the use of knowledge of the domain/environment

4

slide5

Teacher

Single agent learning

Learning

Process

Feed-back

Data

Environment

Learning results

Problem Solving

K & B

Inferences

Strategy

Feed-back

Results

Performance

Evaluation

5

slide6

Learning

Process

Learning results

Problem Solving

K & B Self

Inferences Other

Strategy agents

Results

Agent

Agent

Agent

Performance

Evaluation

Self-interested learning agent

Feed-back

Communication

Data

NB: Both in this diagram and the next, not all components or flow arrows are always present - it depends on the type of agent (cognitive, reactive), type of learning, etc.

Environment

Actions

Feed-back

6

slide7

Problem Solving

K & B Self

Inferences Other

Strategy agents

Problem Solving

K & B Self

Inferences Other

Strategy agents

Agent

Agent

Cooperative learning agents

Feed-back

Learning

Process

Feed-back

Learning

Process

Communication

Learning results

Learning results

Data

Results

Results

Performance

Evaluation

Feed-back

Actions

Actions

Communication

Communication

Environment

7

2 reinforcement learning
2 Reinforcement learning
  • Combines dynamic programming and AI machine learning techniques
  • Trial-and-error interactions with a dynamic environment
  • The feedback of the environment – reward or reinforcement

search in the space of behaviors – genetic algorithms

  • Two main approaches

learn utility based on statistical techniques and dynamic programming methods

8

2 1 a reinforcement learning model

E

T

s

a

I

i

R

B

r

2.1 A reinforcement-learning model

B – agent\'s behavior

i – input = current state of the env

r – value of reinforcement

(reinforcement signal)

T – model of the world

The model consists of:

-         a discrete set of environment states S (sS)

-         a discrete set of agent actions A (a  A)

-         a set of scalar reinforcement signals, typically {0, 1} or real numbers

- the transition model of the world, T

  • environment is nondeterministic

T : S x A P(S) – T = transition model

T(s, a, s’)

Environment history = a sequence of states that leads to a terminal state

9

slide10

+1

-1

A 4 x 3 environment

  • The intended outcome occurs with probability 0.8, and with probability 0.2 (0.1, 0.1) the agent moves at right angles to the intended direction.
  • The two terminal states have reward +1 and –1, all other states have a reward of –0.04

0.8

0.1

0.1

3

2

1

Up, Up, Right, Right, Right

(4,3) 0.85 =0.32768

1 2 3 4

10

slide11
2.2 Features varying RL
  • accessible / inaccessible environment
  • has (T known) / has not a model of the environment
  • learn behavior / learn behavior + model
  • reward received only in terminal states or in any state
  • passive/active learner:
    • learn utilities of states
    • active learner – learn also what to do
  • how does the agent represent B, namely its behavior:
    • utility functions on states or state histories (T is known)
    • active-value functions (T is not necessarily known) - assigns an expected utility to taking a given action in a given state

11

agents
Agents

State and goals

goal : E  {0, 1}

Utilities

utility : E  R

env : E x A P(E)

Expected utility of an action a in a state e

Maximum Expected Utility (MEU)

12

slide13
2.3 The RL problem
  • the agent has to find a policy  = a function which maps states to actions and which maximizes some long-time measure of reinforcement.
  • The agent has to learn an optimal behavior = optimal policy = a policy which yields the highest expected utility - *

The utility function depends on the environment history (a sequence of states)

In each state s the agents receives a reward - R(s)

Uh([s0, s1, …, sn]) – utility function on histories

13

slide14
Models of behavior
  • Finite-horizon model: at a given moment of time the agent should optimize its expected reward for the next h steps

E(t=0, h R(st))

 rt represents the reward received t steps into the future.

  • Infinite-horizon model: optimize the long-run reward

  E(t=0, R(st))

  • Infinite-horizon discounted model: optimize the long-run reward but rewards received in the future are geometrically discounted according to a discount factor.

  E(t=0,t R(st))

0  < 1.

 can be interpreted in several ways. It can be seen as an interest rate, a probability of living another step, or as a mathematical trick to bound an infinite sum.

14

slide15
2.4 Markov systems

Discounted rewards

An AP gets payed 20/year

20+20+20..

(reward now) + (reward at time 1) + 2(rewards at time) 2 + …

A Markov System with rewards

(S1, S2,…Sn)

A transition probability matrix Pij=Prob(Next=Sj|This = Si)

Each state has a rweard r1, r2,…rn

Discount factor  in [0,1]

On each time step

Assume state is Si

Get reward ri

Randomly move to another state Pij

All future rewards are discounted by 

15

slide16
U*(Si)=expected discounted sum of future rewards starting in state Si

U*(Si) =ri+(Pi1U*(S1)+Pi2U*(S2)+ .. +PinU*(Sn)), i=1,n

Solve equations, get an exact answer but 100 000 states splve a 100 000 by 100 000 system of equations

Value iteration to solve a Markov system

U1(Si)=ri

U2(Si) = ri + j=1,N PijU1(Sj)

Compute U1(Si) for each sate

Compute U2(Si) for eaxch state, etc

Stop when |Uk+1(Si) - Uk(Si)| < eps

16

slide17
2.5 Markov Decision Problem (MDP)

consists of:

<S, A, P, R>

S - a set of states

A - a set of actions

R – reward function, R: S x A  R

T : S x A (S), with (S) the probability distribution over the states S

On each time step

Assume state is Si

Get reward Ri

Choose action a (from a1…ak)

Move to another state Pij with probability T(Si,a)

All future rewards are discounted by 

We shall use T(s,a,s’)

Pass’=Prob(Next=s’|This=s and I use action k)

17

slide18
Markov Decision Problem (MDP)
  • The model is Markov if the state transitions are independent of any previous environment states or agent actions.
  • MDP: finite-state and finite-action – focus on that / infinite state and action space
  • For every MDP there exists an optimal policy
  • It’s a policy such that for every possible start state there is no better option than to follow the policy
  • Finding the optimal policy given a model T = calculate the utility of each state U(state) and use state utilities to select an optimal action in each state.

18

slide19
Value iteration to solve a MDP

U1(s)=R(s)

U2(s) = maxa(R(s) + s’ T(s,a,s’)*U1(s))

….

UK+1(s) = maxa(R(s) + s’ T(s,a,s’)*Uk(s))

Compute U1(si) for each state, s=si

Compute U2(si) for each state, etc

Stop when maxi |Uk+1(si) - Uk(si)| < eps

convergence

(dynamic programming)

Value iteration for a MS

Uk+1(Si) = ri + j=1,NPijU k(Sj)

19

slide20
The utility of a state is the immediate reward for that state plus the expected discounted utility of the next state, assuming that the agent chooses the optimal action
  • U(s) = R(s) + max as’T(s,a,s’)*U(s’)
  • Bellman equation - U(s) – unique solutions
  • The utility function U(s) allows the agent to select actions by using the Maximum Expected Utility principle

*(s) = argmaxa (R(s) +  s’T(s,a,s’)*U(s’))

 optimal policy

20

slide21

+1

+1

-1

-1

A 4 x 3 environment

  • The intended outcome occurs with probability 0.8, and with probability 0.2 (0.1, 0.1) the agent moves at right angles to the intended direction.
  • The two terminal states have reward +1 and –1, all other states have a reward of –0.04, =1

0.8

0.1

0.1

3

2

1

3

2

1

0.812

0.868

0.918

0.762

0.660

0.705

0.655

0.611

0.388

1 2 3 4

1 2 3 4

21

slide22

+1

-1

Bellman equation for the 4x3 world

Equation for the state (1,1)

U(1,1) = -0.04 +  max{ 0.8 U(1,2) + 0.1 U(2,1) + 0.1 U(1,1),

Up

0.9U(1,1) + 0.1U(1,2),

Left

0.9U(1,1) + 0.1U(2,1),

Down

0.8U(2,1) +0.1U(1,2) + 0.1U(1,1)}

Right

Up is the best action

3

2

1

0.812

0.868

0.918

0.762

0.660

0.705

0.655

0.611

0.388

1 2 3 4

slide23

defines the best action in state s

Value Iteration

  • Given the maximal expected utility, the optimal policy is:

*(s) = arg maxa(R(s) + s’ T(s,a,s’) * U(s’))

  • Compute U*(s) using an iterative approach Value Iteration

U0(s) = R(s)

Ut+1(s) = R(s) + maxa(s’ T(s,a,s’) * Ut(s’))

t  inf ….utility values converge to the optimal values

compute for all s

23

slide24
Policy iteration

Manipulate the policy directly, rather than finding it indirectly via the optimal value function

  • choose an arbitrary policy  (randomly)
  • at each time t, compute the the long time reward starting in s, using t, i.e. solve the equations

Ut(s) = R(s) + s’ (T(s, t(s),s’) * Ut(s’))

  • improve the policy at each state

t+1(s)  arg maxa (R(s) + s’ T(s,a,s’) * Ut(s’))

Involves all next states - complex

24

slide25
2.6 RL learning
  • Use observed rewards to learn an optimal (or near optimal) policy for the environment

Ex: play 100 moves, you loose

  • In an MDP the agent has a complete model f the evironment
  • Now the agent has not such a model
  • Passive learning – the agent policy is fixedThe tesk is to learn the utilities of states (or state-action pairs)
  • Active learning – the agent must aso learn what to do: exploitation/exploration

25

slide26
(a) Passive reinforcement learning
  • Policy is fixed = in state s always execute (s)
  • Goal – learn how good the policy is = learn U(s)
  • Does not know T(s,a,s’), does not know before R(s)
  • ADP (Adaptive Dynamic Programming) learning

The problem of calculating an optimal policy in an accessible, stochastic environment.

ADP = plug the learned T(s, (s),s’) and the observed rewards R(s) into the Bellman equations to calculate the utility of states

Supervised learning – input: state-action pairs

output: resulting state

Estimate transition probabilities T(s,a,s’) from frequencies with which s’ is reached after executing a in s’

(1,3) – Right – 2 times (2,3), 1 time in (1,3) =>

T((1,3),Right,(2,3))=2/3

26

slide27
ADP (Adaptive Dynamic Programming) learning

function Passive-ADP-Agent(percept) returns an action

inputs: percept, a percept indicating the current state s’ and reward signal r’

variable: , a fixed policy

mdp, an MDP with model T, rewards R, discount 

U, a table of utilities, initially empty

Nsa, a table of frequencies for state-action pairs, initially zero

Nsas’, a table of frequencies of state-action-state triples, initially zero

s, a, the previous state and action, initially null

if s’ is new then U[s’]  r’, R[s’]  r’

if s is not null then

increment Nsa[s,a] and Nsas’[s,a,s’]

for each t such that Nsas’[s,a,t] <>0 do

T[s,a,t]  Nsas’[s,a,t] / Nsa[s,a]

U  Value-Determination(,U,mdp)

if Terminal[s’] then s,a  null else s,a  s’, [s’]

return a

end

according to MDP

(value iteration or policy iteration)

27

slide28
Temporal difference learning

(TD learning)

The value function is no longer implemented by solving a set of linear equations, but it is computed iteratively.

  • Used observed transitions to adjust the values of the observed states so that they agree with the constraint equations.

U(s)  U (s) + (R(s) +  U (s’) – U (s))

 is the learning rate.

  • Whatever state is visited, its estimated value is updated to be closer to R(s) +  U (s’)

since R(s) is the instantaneous reward received and

U (s\') is the estimated value of the actually occurring next state.

  • simpler, involves only next states
  •  decreases as the number of times the state is visited increases

28

slide29
Temporal difference learning

function Passive-TD-Agent(percept) returns an action

inputs: percept, a percept indicating the current state s’ and reward signal r’

variable: , a fixed policy

U, a table of utilities, initially empty

Ns, a table of frequencies for states, initially zero

s, a, r, the previous state, action, and reward, initially null

if s’ is new then U[s’]  r’

if s is not null then

increment Ns[s]

U[s]  U[s] + (Ns[s])(r +  U [s’] – U [s])

if Terminal[s’] then s, a, r  null else s, a, r  s’, [s’], r’

return a

end

29

slide30
Temporal difference learning
  • Does not need a model to perform its updates
  • The environment supplies the connections between neighboring states in the form of observed transitions.

ADP and TD comparison

  • ADP and TD try both to make local adjustments to the utility estimates in order to make each state « agree » with its successors
  • TD adjusts a state to agree with the observed successor
  • ADP adjusts a state to agree with all of the successors that might occur, weighted by their probabilities

30

slide31
(b) Active reinforcement learning
  • Passive learning agent – has a fixed policy that determines its behavior
  • An active learning agent must decide what action to take
  • The agent must learn a complete model with outcome probabilities for all actions (instead of a model for the fixed policy)
  • Compute/learn the utilities that obey the Bellman equation

U (s) = R(s) +  maxas’ (T(s, t(s),s’) * U(s’))

using value iteration r policy iteration

- If value iteration then look for the action that maximze utility

- If policy iteration you already have the action

- Exploration/exploitation

- The representative problem is the n-armed bandit problem

Solutions

  • 1/t time choose random actions, rest follow 
  • give weights to actions that have not been explored, avoid actions with low utilities
  • Exploratory function – f(u,n) – how greedy (prefer high utility vales r not (exploration) the agent is

31

slide32
Q-learning

Active learning of action-value functions

action-value function = assigns an expected utility to taking a given action in a given state, Q-values

Q(a, s)– the value of doing action a in state s (expected utility)

Q-values are related to utility values by the equation:

U(s) = maxaQ(a, s)

  • Approach 1

Q(a,s) = R(s) + s’ (T(s, a,s’) *maxa’ Q(a’,s’))

This requires a model

  • Approach 2

Use TD

The agent does not need to learn a model – model free

32

slide33
Q-learning

TD learning, unknown environment

Q(a,s)  Q(a,s) + (R(s) +  maxa’Q(a’, s’) – Q(a,s))

calculated after each transition from state s to s’.

  • Is it better to learn a model and a utility function or to learn an action-value function with no model?

33

slide34
Q-learning

function Q-Learning-Agent(percept) returns an action

inputs: percept, a percept indicating the current state s’ and reward signal r’

variable: Q, a table of action values index by state and action

Nsa, a table of frequencies for state-action pairs

s, a, r the previous state, action, and reward, initially null

if s is not null then

increment Nsa[s,a]

Q[a,s]  Q[a,s] + (Nsa[s,a])(r +  maxa’Q [a’,s’] – Q [a,s])

if Terminal[s’] then s, a, r  null

else s, a, r`  s’, argmaxa’ f(Q[a’, s’], Nsa[a’,s’]), r’

return a

end

s, a, r`  s’, argmaxa’ (Q[a’, s’]), r’

34

slide35
Generalization of RL
  • The problem of learning in large spaces – large no. of states
  • Generalization techniques - allow compact storage of learned information and transfer of knowledge between "similar" states and actions.
  • Neural nets
  • Decision trees
  • U(state)=U(most similar sate in memory)
  • U(state) =average U(most similar sates in memory)

35

3 learning in mas
3 Learning in MAS
  • The credit-assignment problem (CAP) = the problem of assigning feed-back (credit or blame) for an overall performance of the MAS (increase, decrease) to each agent that contributed to that change
  • inter-agent CAP = assigns credit or blame to the external actions of agents
  • intra-agent CAP = assigns credit or blame for a particular external action of an agent to its internal inferences and decisions
  • distinction not always obvious
  • one or another

36

3 1 learning action coordination
3.1 Learning action coordination
  • s – current environment state
  • Agent i – determines the set of actions it can do in s: Ai(s) = {Aij(s)}
  • Computes the goal relevance of each action: Eij(s)
  • Agent i announces a bid for each action with

Eij(s) > threshold

  • Bij(s) = ( + ) Eij(s)
  •  - risk factor (small)  - noise term (to prevent convergence to local minima)

37

slide38
The action with the highest bid is selected
  • Incompatible actions are eliminated
  • Repeat process until all actions in bids are either selected or eliminated
  • A – selected actions = activity context
  • Execute selected actions
  • Update goal relevance for actions in A

Eij(s)  Eij(s) – Bij(s) + (R / |A|)

R –external reward received

  • Update goal relevance for actions in the previous activity context Ap (actions Akl)

Ekl(sp)  Ekl(sp) + (AijA Bij(s)/ |Ap|)

38

3 2 learning individual performance
3.2 Learning individual performance

The agent learns how to improve its individual performance in a multi-agent settings

Examples

  • Cooperative agents - learning organizational roles
  • Competitive agents - learning from market conditions

39

3 2 1 learning organizational roles nagendra e a
3.2.1 Learning organizational roles(Nagendra, e.a.)
  • Agents learn to adopt a specific role in a particular situation (state) in a cooperative MAS.
  • Aim = to increase utility of final states
  • Each agent may play several roles in a situation
  • The agents learn to select the most appropriate role
  • Use reinforcement learning
  • Utility, Probability, and Cost (UPC) estimates of a role in a situation
  • Utility - the agent\'s estimate of a final state worth for a specific role in a situation – world states mapped to smaller set of situations

S = {s0,…,sf}

Urs = U(sf), s0  …  sf

40

slide41
Probability - the likelihood of reaching a final state for a specific role in a situation

Prs = p(sf), s0  …  sf

  • Cost - the computational cost of reaching a final state for a specific role in a situation
  • Potential of a role - estimates the usefulness of a role, discovering pertinent global information and constraints (ortogonal to utilities)
  • Representation:
  • Sk - vector of situations for agent k, SK1,…,SKn
  • Rk - vector of roles for agent k, Rk1,…,Rkm
  • |Sk| x |Rk| x 4 values to describe UPC and Potential

41

slide42
Functioning

Phase I: Learning

Several learning cycles; in each cycle:

  • each agent goes from s0 to sf and selects its role as the one with the highest probability

Probability of selecting a role r in a situation s

f - objective function used to rate the roles

(e.g., f(U,P,C,Pot) = U*P*C + Pot)

- depends on the domain

42

slide43
Use reinforcement learning to update UPC and the potential of a role
  • For every s  [s0,…,sf] and chosen role r in s

Ursi+1 = (1-)Ursi + Usf

i - the learning cycle

Usf - the utility of a final state

01 - the learning rate

Prsi+1 = (1-)Prsi + O(sf)

O(sf) = 1 if sf is successful, 0 otherwise

43

slide44
Potrsi+1 = (1-)Potrsi + Conf(Path)

Path = [s0,…,sf]

Conf(Path) = 0 if there are conflicts on the Path, 1 otherwise

  • The update rules for cost are domain dependent

Phase II: Performing

In a situation s the role r is chosen such that:

44

3 2 2 learning in market environments vidal durfee
3.2.2 Learning in market environments(Vidal & Durfee)

Agents use past experience and evolved models of other agents to better sell and buy goods

  • Environment = a market in which agents buy and sell information (electronic marketplace)
  • Open environment
  • The agents are self-interested (max local utility)

{g} - a set of goods

P - set of possible prices for goods

Qg - set of possible qualities for a good g

45

slide46
information has a cost for the seller and a value for the buyer
  • information is sold at a certain price
  • a buyer announces a good it needs
  • sellers bid their prices for delivering the good
  • the buyer selects from these bids and pays the corresponding price
  • the buyer assesses the quality of information after it receives it from the seller
  • Profit of a seller s for selling the good g at price p

Profitsg(p) = p - csg

csg - the cost of producing the good g by s; p - the price

  • Value of a good g for a buyer b

Vbg(p,q) p - price b paid for g

q - quality of good g

Goal seller - maximize profit in a transaction

buyer - maximize value in a transaction

46

slide47
3 types of agents

0-level agents

  • they set their buying and selling prices based on their own past experience
  • they do not model the behavior of other agents

1-level agents

  • model other agents based on previous interactions
  • they set their buying and selling prices based on these models and on past experience
  • they model the other agents as 0-level agents

2-level agents

  • same as 1-level agents but they model the other agents as 1-level agents

47

slide48
Strategy of 0-level agents

0-level buyer

- learns the expected value function, fg(p), of buying g at price p

- uses reinforcement learning

fgi+1(p) = (1-)fgi(p) + Vbg(p,q), min   1, for i=0,  = 1

- chooses the seller s* for supplying a good g

0-level seller

- learns the expected profit function, hg(p),if it offers good g at price p

- uses reinforcement learning

hgi+1(p) = (1-)hgi(p) + Profitbg(p)

where Profitbg(p) = p - csg if it wins the auction, 0 otherwise

- chooses the price ps* to sell the good g so as to maximize profit

48

slide49
Strategy of 1-level agents

1-level buyer

- models sellers for good g

- does not model other buyers

- uses a probability distribution function qsg(x) over the qualities x of a good g

- computes expected utility, Esg, of buying good g from seller s

- chooses the seller s* for supplying a good g that maximizes this expected utility

49

slide50
1-level seller

- models buyers for good g

- models the other sellers s for good g

  • Buyer\'s modeling

- uses a probability distribution function mbg(p) - the probability that b will choose price p for good g

  • Seller\'s modeling

- uses a probability distribution function ns\'g(y) - the probability that s\' will bid price y for good g

- computes the probability of bidding lower than a given seller s\' with the price p

Prob_of_bidding_lower_than_s\' =

p\'(Prob of bid of s\' with p\' for which s wins) =

p\' N(g,b,s;s\',p,p\')

N(g,b,s;s\',p,p\') = ns\'g(p\') if mbg(p\')  mbg(p)

0 otherwise

50

slide51
- computes the probability of bidding lower than all other sellers with the price p

Prob_of_bidding_lower_with_p =

 (Prob_of_bidding_lower_than_s\')

s\'S - {s}

- chooses the best price p* to bid so as to maximize profit

51

3 3 learning to communicate
3.3 Learning to communicate
  • What to communicate (e.g., what information is of interest to the others)
  • When to communicate (e.g., when try doing something by itself or when look for help)
  • With which agents to communicate
  • How to communicate (e.g., language, protocol, ontology)

52

learning with which agents to communicate ohko e a
Learning with which agents to communicate(Ohko, e.a. )
  • Learning to which agents to ask for performing a task
  • Used in a contract net protocol for task allocation to reduce communication for task announcement
  • Goal = acquire and refine knowledge about other agents\' task solving abilities
  • Case-based reasoning used for knowledge acquisition and refinement

A case consists of:

(1) A task specification

(2) Information about which agents solved a task or similar tasks in the past and the quality of the provided solution

53

slide54
(1)Task specification

Ti = {Ai1 Vi1, …, Aimi Vimi}

Aij - task attribute, Vij - value of attribute

  • Similar tasks

Sim(Ti, Tj) = r s Dist(Air, Ajs)

AirTi, AjsTj

Dist(Air, Ajs) = Sim_Attr(Air, Ajs) * Sim_Vals(Vir, Vjs)

  • Set of similar tasks

S(T) = {Tj : Sim(T, Tj) 0.85}

54

slide55
(2)Which agents performed T or similar tasks in the past

Suitability of Agent k

Perform(Ak, Tj) - quality of solution for Tj assured by agent Ak performing Tj in the past

  • The agent computes

{ Suit(Ak, T), Suit(Ak, T)>0 } and selects the agent k* such that

or the first m agents with best suitability

  • After each encounter, the agent stores the tasks performed by other agents and the solution quality
  • Tradeoff between exploitation and exploration

55

3 4 layered learning
3.4 Layered learning

(Stone & Veloso)

  • A hierarchical machine learning paradigm in MAS
  • Used simulated robotic soccer – RoboCup

Learning

Input  Output – Intractable

  • Decompose the learning task L into subtasks: L1, …, Ln
  • Characteristics of the environment:
    • Cooperative MAS
    • Teammates and adversaries
    • Hidden states – agents have a partial world view at any given moment
    • Agents have noisy sensory data and actuators
    • Perception and action cycles are asynchronous
    • Agents must make their decisions in real-time

56

slide57
Problem: the agent receives a moving ball and must decide what to do with it: dribble, pass to a teammate, shoot towards the goal
  • Decompose the problem into 3 subtasks:

Layer Behavior type Example

L1 Individual Ball interception

L2 Multiagent Pass evaluation

L3 Team Pass selection

  • The decomposition into subtasks enables the learning of more complex behaviors
  • The hierarchical task decomposition is constructed bottom-up, in a domain dependent fashion
  • Learning methods are chosen to suit the task
  • Learning in one layer feeds into the next layer either by providing a portion of the behavior used for training (ball interception – pass evaluation) or by creating the input representation and pruning the action space (pass evaluation – pass selection)

57

slide58
L1 – Ball interception

behavior = individual

  • Aim:
    • Blocks or intercepts opponents shots or passes or
    • Receive passes from teammates
  • Learning method: a fully connected backpropagation NN
  • Repeatedly shooting the ball towards a defender in front of a goal.The defender collects t.e. by acting randomly and noticing when it successfully stops the ball
  • Classification:
    • Saves = successful interceptions
    • Goals = unsuccessful attempts
    • Misses = shoots that went wide of the goal

58

slide59
L2 – Pass evaluation

behavior = multiagent

  • Uses its learned ball-interception skills as part of the behavior for training MAS behavior
  • Aim: the agent must decide
    • To pass (or not) the ball to a teammate and
    • If the teammate will successfully receive the ball (based on positions + abilities of the teammate to receive or intercept a pass)
  • Learning method: decision trees (C4.5)
  • Kick the ball towards randomly placed teammates interspread with randomly placed opponents
  • The intended pass recipient and the opponents all use the learned ball-interception behavior
  • Classification of a potential pass to a receiver:
    • Success, with a c.f.  (0,1]
    • Failure, with a c.f.  [-1,0)
    • Miss, (= 0)

59

slide60
L3 – Pass selection

behavior = team

  • Uses its learned pass-evaluation capabilities to create the input and output set for learning pass selection
  • Aim: the agent has the ball and must decide
    • To which teammate to pass the ball or
    • Shoot on goal
  • Learning method: Q-learning of a function that depends on the agent’s position on the field
  • Simulate 2 teams playing with identical behavior others than their pass-selection policies
  • Reinforcement = total goals scored
  • Learns:
    • Shoot the goal
    • The teammate to which to pass

60

4 conclusions
4 Conclusions
  • There is no unique method or set of methods for learning in MAS
  • Many approaches are based on extending ML techniques in a MAS setting
  • Many approaches use reinforcement learning, but also NN or genetic algorithms

61

slide62
References
  • S. Sen, G. Weiss. Learning in Multiagent systems. In Multiagent Systems - A Modern Approach to Distributed Artificial Intelligence, G. Weiss (Ed.), The MIT Press, 2001, p.257-298.
  • T. Ohko, e.a. - Addressee learning and message interception for communication load reduction in multiple robot environment. In Distributed Artificial Intelligence Meets Machine Learning, G. Weiss, Ed., Lecture Notes in Artificial Intelligence, Vol. 1221, Springer-Verlag, 1997, p.242-258.
  • M.V. Nagendra, e.a. Learning organizational roles in a heterogeneous multi-agent systems. In Proc. of the Second International Conference on Multiagent Systems, AAAI Press, 1996, p.291-298.
  • J.M. Vidal, E.H. Durfee. The impact of nested agent models in an information economy. In Proc. of the Second International Conference on Multiagent Systems, AAAI Press, 1996, p.377-384.
  • P. Stone, M. Veloso. Layered Learning, Eleventh European Conference on Machine Learning, ECML-2000.

62

slide63
Web References
  • An interesting set of training examples and the connection between decision trees and rules.

http://www.dcs.napier.ac.uk/~peter/vldb/dm/node11.html

  • Decision trees construction

http://www.cs.uregina.ca/~hamilton/courses/831/notes/ml/dtrees/4_dtrees2.html

  • Building Classification Models: ID3 and C4.5

http://yoda.cis.temple.edu:8080/UGAIWWW/lectures/C45/

  • nIntroduction to Reinforcement Learning

http://www.cs.indiana.edu/~gasser/Salsa/rl.html

nOn-line book on Reinforcement Learning

http://www-anw.cs.umass.edu/~rich/book/the-book.html

63

ad