Learning in Multiagent Systems

Learning in Multiagent Systems Advanced AI Seminar Michael Weinberg The Hebrew University in Jerusalem, Israel March 2003

Agenda • What is learning in MAS? • General Characterization • Learning and Activity Coordination • Learning about and from Other Agents • Learning and Communication • Conclusions Advanced AI Seminar, March 2003

What is Learning • Learning can be informally defined as: • The acquisition of new knowledge and motor or cognitive skills and the incorporation of the acquired knowledge and skills in future system activities, provided that this acquisition and incorporation is conducted by the system itself and leads to an improvement in its performance Advanced AI Seminar, March 2003

Learning in Multiagent Systems • Intersection of DAI and ML • Why bring them together? • There is a strong need to equip Multiagent systems with learning abilities • The extended view of ML as Multiagent learning is qualitatively different from traditional ML and can lead to novel ML techniques and algorithms Advanced AI Seminar, March 2003

General Characterization • Principal categories of learning • The features in which learning approaches may differ • The fundamental learning problem known as the credit-assignment problem Advanced AI Seminar, March 2003

Principal Categories • Centralized Learning (isolated learning) • Learning executed by a single agent, no interaction with other agents • Several centralized learners may try to obtain different or identical goals at the same time Advanced AI Seminar, March 2003

Principal Categories • Decentralized Learning (interactive learning) • Several agents are engaged in the same learning process • Several groups of agents may try to obtain different or identical learning goals at the same time • Single agent may be involved in several centralized/decentralized learning processes at the same time Advanced AI Seminar, March 2003

Differencing Features:The degree of decentralization • The degree of decentralization • Distributedness • Parallelism Advanced AI Seminar, March 2003

Differencing Features:Interaction-specific features • Classification of the interactions required for realizing a decentralized learning process: • The level of interaction • The persistence of interaction • The frequency of interaction • The variability of interaction Advanced AI Seminar, March 2003

Differencing Features:Involvement-specific features • Features that characterize the involvement of an agent into a learning process: • The relevance of involvement • The role played during involvement Advanced AI Seminar, March 2003

Differencing Features:Goal-specific features • Features that characterize the learning goal: • Type of improvement that is tried to be achieved by learning • Compatibility of the learning goals pursued by the agents Advanced AI Seminar, March 2003

Differencing Features:The learning method • The following learning methods are distinguished: • Rote learning • Learning from instruction and by advice taking • Learning from examples and by practice • Learning by analogy • Learning by discovery • The main difference is in the required amount of learning efforts Advanced AI Seminar, March 2003

Differencing Features:The learning feedback • The learning feedback indicates the performance level achieved so far • The following learning feedbacks are distinguished: • Supervised learning (teacher) • Reinforcement learning (critic) • Unsupervised learning (observer) Advanced AI Seminar, March 2003

The Credit-Assignment Problem • The problem of properly assigning feedback for an overall performance change to each of the system activities that contributed to that change • Can be usefully decomposed into two sub-problems: • The inter-agent CAP • The intra-agent CAP Advanced AI Seminar, March 2003

The inter-agent CAP • Assignment of credit or blame for an overall performance change to the external actions of the agents Advanced AI Seminar, March 2003

The intra-agent CAP • Assignment of credit or blame for a particular external action of an agent to its underlying internal inferences and decisions Advanced AI Seminar, March 2003

Learning and Activity Coordination • Previous research on coordination focused on off-line design of behavioral rules, negotiation protocols, etc… • Agents operating in open, dynamic environments must be able to adapt to changing demands and opportunities • How can agents learn to appropriately coordinate their activities? Advanced AI Seminar, March 2003

Reinforcement Learning • Agents choose the next action so as to maximize a scalar reinforcement or feedback received after each action • The learner’s environment can be modeled by a discrete time, finite state, Markov Decision Process (MDP) Advanced AI Seminar, March 2003

Markov Decision Process (MDP) • MDP - Reinforcement Learning task that satisfies the Markov state property • Markov State satisfies Advanced AI Seminar, March 2003

Reinforcement Learning (cont) • The environment in MDP represented by a 4-tuple <S,A,P,r > • is a set of states • is a set of actions • Each agent maintains a policy that maps states into desirable actions Advanced AI Seminar, March 2003

Q-Learning Algorithm • Reinforcement Learning algorithm • Maintains a table of Q-values • Q(x,a) –“how good action a is in state x”? • Converges to the optimum Q-values with probability 1 Advanced AI Seminar, March 2003

Q-Learning Algorithm (cont) • At step n the agent performs the following steps: • Observe its current state • Select and perform action • Observe the subsequent state • Receive immediate payoff • Adjust values Advanced AI Seminar, March 2003

Discounted Sum of Future Rewards • Q-Learning finds an optimal policy that maximizes total discounted expected reward • Discounted reward – reward received s steps hence are worth less than reward received now by a factor of Advanced AI Seminar, March 2003

Evaluating the Policy • Under policy the value of state x is: • The optimal policy satisfies: Advanced AI Seminar, March 2003

Q-Values • Under policy define Q-values as: • Executing action and following policy thereafter Advanced AI Seminar, March 2003

Adjusting Q-Values • Update Q-values as following: • If and • Otherwise • Where Advanced AI Seminar, March 2003

Isolated, Concurrent Reinforcement Learners • Reinforcement learners develop action selection policies that optimize environmental feedback • Can be used in domains • With no pre-existing domain expertise • With no information about other agents • RL can be used as new coordination techniques where currently available coordination schemes are ineffective Advanced AI Seminar, March 2003

Isolated, Concurrent Reinforcement Learners • Each agent learns to optimize its reinforcement from the environment • Other agents are not explicitly modeled • An interesting research question is whether it is feasible for such an agent to use the same learning mechanism in both cooperative and non-cooperative environments Advanced AI Seminar, March 2003

Isolated, Concurrent Reinforcement Learners • An assumption of most RL techniques is that the dynamics of the environment is not affected by other agencies • This assumption is invalid in domains with multiple,concurrent learners • Standard RL is probably not adequate for concurrent, isolated learning of coordination Advanced AI Seminar, March 2003

Isolated, Concurrent Reinforcement Learners • The following dimensions were identified to characterized domains amenable to CIRL: • Agent coupling (tightly/loosely) • Agent relationships (cooperative/adversarial) • Feedback timing (immediate/delayed) • Optimal behavior combinations Advanced AI Seminar, March 2003

Experiments with CIRL • Conclusions: • Through CIRL both friends and foes can concurrently acquire useful coordination info • No prior knowledge of the domain needed • No explicit model of the capabilities of other agents is required • Limitations: • Inability to develop effective coordination when agents are strongly coupled, feedback is delayed and there are only few optimal behavior combinations Advanced AI Seminar, March 2003

Experiments with CIRL • A possible fix to the last limitation is “lock-step learning”: • Two agents synchronize their behavior so that one is learning while the other is following a fixed policy and vice versa Advanced AI Seminar, March 2003

Interactive Reinforcement Learning of Coordination • Agents can explicitly communicate to decide on individual and group actions • Few algorithms for Interactive RL: • Action Estimation Algorithm • Action Group Estimation Algorithm Advanced AI Seminar, March 2003

Learning about and from Other Agents • Agents learn to improve their individual performance • Better capitalize on available opportunities by prediction the behavior of other agents (preferences, strategies, intentions, etc…) Advanced AI Seminar, March 2003

Learning Organizational Roles • Assume agents have the capability of playing one of several roles in a situation • Agents need to learn role assignments to effectively complement each other Advanced AI Seminar, March 2003

Learning Organizational Roles • The framework includes Utility, Probability and Cost (UPC) estimates of a role adopted at a particular situation • Utility– desired final state’s worth if the agent adopted the given role in the current situation • Probability– likelihood of reaching a successful final state (given role/situation) • Cost– associated computational cost incurred • Potential– usefulness of a role in discovering pertinent global information Advanced AI Seminar, March 2003

Learning Organizational Roles:Theoretical Framework • sets of situations and roles for agent k • An agent maintains vectors of UPC • During the learning phase: • rates a role by combining the component measures Advanced AI Seminar, March 2003

Learning Organizational Roles:Theoretical Framework • After the learning phase is over, the role to be played in situation s is: • UPC values are learned using reinforcement learning • UPC estimates after n updates: Advanced AI Seminar, March 2003

Learning Organizational Roles:Updating the Utility • S –the situations encountered between the time of adopting role r in situation s and reaching a final state F with utility • The utility values for all roles chosen in each of the situation in S are updated: Advanced AI Seminar, March 2003

Learning Organizational Roles:Updating the Probability • - returns 1 if the given state is successful • The update rule for probability: Advanced AI Seminar, March 2003

Learning Organizational Roles:Updating the Potential • - returns 1 if in the path to the final state, conflicts are detected and resolved by information exchange • The update rule for potential: Advanced AI Seminar, March 2003

Learning Organizational Roles:Robotic Soccer Game • Most implementations of robotic soccer teams use the approach of learning organizational roles • Use layered learning methodology: • Low level skills (e.g. shoot the ball) • High level decision making (e.g. who to pass to) Advanced AI Seminar, March 2003

Learning in Market Environments • Buyers and sellers trade in electronic marketplaces • Three types of agents: • 0-level agents: don’t model the behaviour of others • 1-level agents: model others as 0-level agents • 2-level agents: model others as 1-level agents Advanced AI Seminar, March 2003

Learning to Exploit an Opponent:Model-Based Approach • The most prominent approach in AI for developing playing strategies is the minimax algorithm • Assumes that the opponent will choose the worst move • An accurate model of the opponent can be used to develop better strategies Advanced AI Seminar, March 2003

Learning to Exploit an OpponentModel-Based Approach • The main problem of RL is its slow convergence • Model based approach tries to reduce the number of interaction examples needed for learning • Perform deeper analysis of past interaction experience Advanced AI Seminar, March 2003

Model Based Approach • The learning process is split into two separate stages: • Infer a model of the other agent based on past experience • Utilize the learned model for designing effective interaction strategy for the future Advanced AI Seminar, March 2003

Inferring a Best-Response Strategy • Represent the opponent’s model as a DFA • Example: The TFT strategy for the IPD game • Theorem: Given a DFA opponent model there exists a best response DFA that can be computed in time polynomial in Advanced AI Seminar, March 2003

Learning in Multiagent Systems

Learning in Multiagent Systems

Presentation Transcript

Multiagent Systems

Cooperative Games in Multiagent Systems AAMAS’11

Introduction to Multiagent Systems

MultiAgent Systems

Multiagent Systems and Organizations

Evolution of Teamwork in Multiagent Systems

Influence in MultiAgent Systems Application to Coalitions

Software Multiagent Systems: CS543

MultiAgent Systems

From Multiagent Systems to Multiagent Societies

Learning in Multiagent systems

Norms in multiagent systems

Multiple timescales for multiagent learning

Multiagent Systems

Uncertain Multiagent Systems: Games and Learning

Multiagent Systems

Autonomous Multiagent Systems

Multiagent Systems

Multiagent Systems