Towards Self-Adaptive MAS Fault-Tolerant MAS

OASIS Towards Self-Adaptive MAS Fault-Tolerant MAS Zahia Guessoum OASIS (Objects and Agents for Simulation and Information Systems) LIP6 (Laboratoire d'Informatique deParis 6)Zahia.Guessoum@lip6.frhttp://www-poleia.lip6/~guessoum

Outline • Towards Self-Adaptive MAS • Fault-Tolerant MAS • Motivations • Multi-agent Architecture • Agent Criticality • Resources Management • Experiments • Conclusions and Future Work

Self-Adaptive MAS A A A A A A A A A Design and control of dynamic and complex systems • Characteristics • Open, distributed, large scale • Dynamic environment • Limited resources • Several users • … • Existing solutions: adaptive multi-agent systems • adaptation of the internal agent structure or behavior

Self-Adaptive MAS • Limits of Adaptive Multi-Agent Systems • emergence of global undesirable behaviors • Our Solution: Self-Adaptive Multi-Agent Systems [JFSMA’00] • An intelligent system must be able to observe its own behavior [Pitrat 90] • Monitor the system to detect, and when possible to anticipate, undesirable behaviors • Reification of the needed system’s aspects to detect or anticipate undesirable conditions • Examples: Interdependence graph, roles

Self-Adaptive MAS • Adaptation • Micro level (agent) • How to adapt the structure and behavior of an agent according to the evolution of its environment? • Macro level (organization) • How to detect, and when possible to anticipate, global undesirable behaviors? • Models and Architectures: Adaptive Agents and Adaptive Multi-Agent Systems

Self-Adaptive MAS Software Engineering Adaptive Multi-Agent Models and Architectures • Reflective Architectures • MDA Distributed Systems • Replication Artificial Intelligence • Learning techniques • Ontologies

Main projects • Research projects • Adaptive Agent architecture [Objet’98][IEEE’99] [AAMAS-AISB’04][ALAMAS’05] • Self-Adaptive MAS [JFIADSMA’00] [IEEE DS’04] [AAMAS’04] • Fault-Tolerant MAS [SELMAS’03][AAMAS’03] [AAMAS’04] [SELMAS’05] • Meta-DIMA: MDA-based multi-agent engineering methodology [JFIADSMA’03] • … • Applications • Simulation of Economics Models (Firms and Organizational Forms) [AAMAS’04bis] [ALAMAS’05] [CEEMAS’O5][EA’05] • COGents: Agent-Based Architecture For Numerical Simulation [E-work’02] [ICAP’03] • … • PhD Students • Lilia Rejeb, Ana B. Gonzalez, Othmane Nadjemi, Nora Faci, Tarek Jarraya, Beiting Zhu, David Julien (PhD, Dec. 2004)

Outline • Fault-Tolerant MAS Team • Motivations • Multi-agent Architecture • Agent Criticality • Resources Management • Implementation and Experiments • Conclusions and Future Work • MAS • Z. Guessoum (LIP6) • S. Aknine (LIP6) • J-P Briot (CNRS, LIP6) • N. Faci (CReSTIC, Reims) • A. Suna-Elmeida (LIP6) • J. Malenfant (LIP6) • Distributed Systems • P. Sens (LIP6 – INRIA) • M. Bertier (LIP6) • O. Marin (LIP6)

Fault-Tolerant MAS A A A A A A A A A • Large-scale multi-agent systems • Physically distributed • Limited resources • Dynamic environment • Types of failures • Software (bugs, deadlocks, ...) • Hardware (Network links, machines,...) • How to avoid failures ?

Fault, Error, Failure A failure occurs when an actual running system deviates from this specified behavior. The cause of a failure is called an error. An error represents an invalid system state, one that is not allowed by the system behavior specification. The error itself is the result of a defect in the system or fault. A fault is the root cause of a failure. That means that an error is merely the symptom of a fault. A fault may not necessarily result in an error

Fault Classifications • Based on how a failed component behaves once it has failed, faults can be classified into 4 categories: crash, omission, timing or Byzantine. • Crash faults: the component either completely stops operating or never returns to a valid state; • Omission faults: the component completely fails to perform its service; • Timing faults: the component does not complete its service on time; • Byzantine faults: these are faults of an arbitrary nature. ** * * *

Replication • Existing solution: Replication strategies • Replication of data and/or computation is an effective way to achieve fault tolerance in distributed systems. • A replicated software component is defined as a software component that possesses a representation on two or more hosts. • Distributed applications: • Small number of components • Component criticality is static • … • The number of replicas and the replication strategy are explicitly and statically defined by the designer before run time

Agent Replication • Simple MAS: • Small number of agents • Static organizational structures • … The agent criticality may be statically defined by the designer before run time • Complex MAS: • Adaptive agents • Large scale • Dynamic and adaptive organizational structures • … The agent criticality (the number of replicats and the replication strategy) cannot be explicitly and statically defined by the designer before run time

Dynamically and Automaticallyapply replication mechanismswhere (to which agents) and when it is most needed.

Dynamic Replication • DarX: a new replication framework • http://www-src.lip6.fr/darx/ • Large-scale distributed systems • Replication mechanisms • Several replication strategies (active, passive, hybrid…) • Dynamic replication: change dynamically the number of replicas and the replication strategy • Observation mechanisms • Fault detection/recovery mechanisms • Encapsulation of the system tasks into the replication group • Transparence of the replication regarding the other agents • Replication mechanisms are not attached to the DarX servers, they are attached to the replication groups • …

Dynamic Replication Consistency information + Replication policy for group A Strategy s2 Strategy s1 A.l A.r1 A.r2 Replication group A DARX Location DARX Location DARX Location • Replication Group (RG)

Automatic Replication • Adaptive Replication Mechanism • Which agents need to be replication and when? • What is the number of replicas? • Where?

Adaptive Control of Replication • Hypothesis and principles • Automatic mechanisms • Some prior inputs from the designer of the application • Agents can be either reactive or deliberative • Agents can be heterogeneous • Agents communicate with some ACL (FIPA, …) • Agent criticality relies on Semantic-level information • Roles [Selmas’03] [AAMAS’02] • Interdependence graph [AAMAS’04 [Selmas’05] • …

Multi-Agent Architecture [AAMAS’04] • Micro component (agents) + Macro component (Interdependence graph)

Interdependence Graph • The analysis of an agent dependences allows to define its importance and the influence of its failure on the behavior and reliability of the multi-agent system. • The arcs are labeled by any information which is susceptible to enable the detection or anticipation of undesirable behaviors (failure of agents). w12 2 1 More critical k j Agent_k i Agent_j Agent_i

Multi-Agent Architecture • Micro component (agents) + Macro component (Interdependence graph) + Distributed Monitors

Multi-Agent Architecture Monitor 1 Monitor 2 node-2 Adaptation algorithm Observation Level Monitor 3 Monitor 4 Host-Monitor Host-Monitor Host_i Host_j Agent 2 Agent 1 Agents Level Agent 3 Agent 4

Multi-Agent Architecture • Domain Agents • Represent the knowledge of the application domain • May have a perception of the interdependence graph • … • Monitori • Observe the domain agent Agenti • Read the messages received from the host-monitor • Build/update the interdependences of Agenti (Nodei) • Compute Agenti criticality wi= aggregation (wji j=1,m)) • wjii: its interdependences with agentj • Inform the host-monitor of local important changes • …

Multi-Agent Architecture • Interdependences Adaptation • Algorithm 1: number of messages (or communication load) Let NbMij (t) be the number of messages sent by agentito agentjduring some interval of time t Let NbMbe the average number of messages between couples of agents (i,j) wij(t + t)= wij(t) + (NbMij (t) – NbM (t))/ NbM (t)

Interdependences Adaptation • Algorithm 2:performatives of messages[M.Colombetti and M.Verdicchio] M.Colombetti and M.Verdicchio proposed six classes • class 1 =request, query-if, query-ref, subscribe • class 2 = inform, inform-done, inform-ref • class 3 = cfp, propose • class 4 = reject-proposal, refuse, cancel • class 5 = accept-proposal, agree • class 6 = not-understood, failure. Let mєSij(t) the set of messages by agentito agentjduring some interval of time t Let WMij be ∑ mєSij(t) weight(m) Let WM (t) be the average sum of weight of messages between couples of agents (i,j) wij(t + t)= wij(t) + (WMij (t) - WM(t)) / WM(t)

Multi-Agent Architecture • Monitors System Statistics Activity Analysis Agent’s Criticality Interdependence Analysis & Role Interaction Events Observation Replication Domain Agents Darx Server

Multi-Agent Architecture • Host-Monitors • Build global information • Read messages received from the monitors • Update local statistics which define aggregation of the host-monitors parameters • Send the new parameters to the agent monitors of the local host • Send to the other host monitors the observed parameters which have significantly changed. • Resource management • Allocate replicas to agents

Resource Management • Number of replicas • An agent is replicated according to: • wi: its criticality • W: the sum of the domain agents' criticality • rm: the minimum number of replicas, it is introduced by the designer • Rm: the maximum number of possible simultaneous replicas nbi = rounded ( rm + wi * Rm / W) • Problem: • All the resources are considered as similar. For instance, the failure rate of a host is not considered. • The hosts are not easy to choose

Resource Management • Our Solution: Economic Model • Resource cost, budget, and negotiation behaviors of the host-monitors to allocate resources. • Cost of a resource • CMi(t) = CMi(t0) *(1-ppi(t)) • ppi(t) is the failure probability of hosti, at time t. • CMi(t0) is the initial cost of hosti • ppi(t0)=0. • The budget is based on the criticality • Bj(t)=Wj(t) *CM(t) /W(t) • W(t) =∑i=1,n Wi(t) • CM(t) = ∑i=1,m CMi(t) * Nbi • where n is the number of agents, m is the number of hosts and Nbi the number of resources of hosti • What is the number of replicats and where (which hosts) ? • If Bj(t +t) > Bj(t +t) then allocate new resources • Simple negotiation between Host-Monitors • If Bj(t +t) < Bj(t +t) then cancel some allocated resources

Resource Management • Contract Net Protocol • Initiator: Host-Monitor • Participant: other Host_Monitors • Evaluation criteria • Communication time between the two hosts • Resource cost Host-Monitor Host-Monitor Agent-Monitor request Call for proposal propose Accept proposal Reject proposal

Implementation Adaptive Replication Control Observation DIMA Agents Adaptor DarX Replication Naming/Localization Failure Detection (FD) • DimaX: A Fault-Tolerant Multi-Agent Platform • Various services (naming service, fault detection, replication, …) • Agent monitors and host-monitors • …

Experiments • Example: Personal assistant agents • Interact with the user to receive their meeting requests and associated information (a title, a description, possible dates, participants, priority, etc.) , • Interact with the other agents of the system to schedule meetings.

Experiments • Monitoring cost • N (100, ... 250) agents • N/20 hosts • Two kinds of experiments • without monitoring • with monitoring • With Algorithm 1 • With Algorithm 2

Experiments • Monitoring cost

Experiments • Experiments • Previous protocol • Periods of monitoring (500, 1500, 2500)

Experiments • Robustness • 100 agents on 10 machines • Failure simulator: randomly stops the thread of an agent • Scenario • 50 meetings • Goal of the MAS: Schedule the 50 meetings • Rate of successful simulations • Number of simulations which did not fail / total number of simulations • 4 replication approaches • Random • Roles • Algorithm 1: Number of messages • Algorithm 2: Performatives

Experiments Robustness

Conclusions and Future Work • A new fault-tolerant multi-agent platform (DimaX) • Based on DIMA and DarX • A new approach to evaluate dynamically the criticality of agents • Small applications have been developed (meetings scheduling …) • Algorithms to define interdependence • Messages • ACL messages • Domain task dependences • Other categories of faults • Timing, Byzantine (Master of Parjineh) • More experiments • To validate the proposed approach • To better identify: • the potential target application domains (load balancing …) • the domains for which the approach is not suited

Related publications(see www-poleia.lip6.fr/~guessoum/Papers.html) • Z. Guessoum, N. Faci and J.-P. Briot. Adaptive Replication of Large-Scale Multi-Agent Systems - Towards a Fault-Tolerant Multi-Agent Platform, In proc. ICSE'02, 4th International Workshop on Software Engineering for Large-Scale Multi-Agent Systems (SELMAS'02), to appear in ACM, Saint-Louis (US), May 2005. • Z. Guessoum, M. Ziane, N. Faci, Monitoring and Organizational-Level Adaptation of Multi-Agent Systems, Third International Joint Conference on Autonomous Agents and Multi-Agents Systems (AAMAS’04), ACM, pp. 514-522, New York City, July 2004. • Z. Guessoum, J.-P. Briot, O. Marin, A. Hamel and P. Sens. Dynamic and Adaptative Replication for Large-Scale Reliable Multi-Agent Systems. In Software Engineering for Large-Scale Multi-Agent Systems, Alessandro Fabricio Garcia (ed.), LNCS 2603, May, 2003. • Z. Guessoum, J.-P. Briot, S. Charpentier, O. Marin and P. Sens. A Fault-Tolerant MultiAgent Framework, AAMAS 2002, July 15-19, 2002, Proceedings pp. 672-673. ACM 2002. • O. Marin, P. Sens, J.-P. Briot and Z. Guessoum. Towards Adaptive Fault-Tolerance for Distributed Multi-Agent Systems'‘, ERSADS'2001, Bertinoro, Italy, May 2001.

Towards Self-Adaptive MAS Fault-Tolerant MAS

Towards Self-Adaptive MAS Fault-Tolerant MAS

Presentation Transcript

Fault-tolerant Adaptive Divisible Load Scheduling

mas

MAS

MAS Versicherungsmedizin

3B MAS

MAS

MAS

fault-tolerant

MAS