1 / 62

Fault Tolerance in an Event Rule Framework for Distributed Systems

Fault Tolerance in an Event Rule Framework for Distributed Systems. Hillary Caituiro Monge. Contents. Introduction Related Works Overview of the Event Rule Framework (ERF) Overview of the Fault Tolerant CORBA Design of the Fault tolerant ERF (FT-ERF) Performance Analysis Conclusions.

linh
Download Presentation

Fault Tolerance in an Event Rule Framework for Distributed Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fault Tolerance in an Event Rule Framework for Distributed Systems Hillary Caituiro Monge

  2. Contents • Introduction • Related Works • Overview of the Event Rule Framework (ERF) • Overview of the Fault Tolerant CORBA • Design of the Fault tolerant ERF (FT-ERF) • Performance Analysis • Conclusions

  3. Introduction • Justification • Distributed Systems (DS) • Fault Tolerance (FT) • Reactive Components (RC) • The Event Rule Framework (ERF) • Motivation • Objectives

  4. Distributed Systems (DS) • A DS is a • Collection of software components distributed among processors of heterogeneous platforms • DSs purpose are: • Sharing resources and workload, and • Maximizing availability. • Design goals of DSs are: • Transparency, • Scalability, • Reliability and • Performance.

  5. Fault Tolerance (FT) • FT is the ability of a system to continue operating as expected, despite internal or external failures. • DSs are prone to failures. • Some faults can be detected. • Some others cannot be detected. • FT of a DS can be improved through the redundancy, i.e. replication of its hardware or software components.

  6. Reactive Components (RC • Reactive Components • React to external stimulus (i.e. events) • Initiate action • A RC can be • Asynchronous or synchronous • Non-deterministic or deterministic • A reactive component can be asynchronous and non-deterministic (ANDRC).

  7. The Event Rule Framework (ERF) • An example of a DS framework having ANDRCs is: • ERF (Event/Rule Framework). • ERF • Developed at the Center for Computing Research and Development of University of Puerto Rico – Mayagüez Campus. • It is an Event-Rule Framework for developing distributed systems. • In ERF, Events and rules are used as abstractions for specifying system behavior.

  8. Motivation (1/2) • There is a challenge to achieve fault tolerance in ANDRCs. • In non-deterministic components: • The output could be different; • Even if the same sequence of stimuli is input with the same initial state. • Since the component is asynchronous: • Timing assumptions are not valid. • Moreover, ANDRCs behavior fulfills the Heisenberg’s uncertainly principle:

  9. Motivation (2/2) • Existing fault-tolerance techniques • Failure detectors • Timing assumptions • Synchronous or semi synchronous systems, • State transfer protocols • Deterministic systems • Very intrusive • Duplicates detection and suppression mechanisms • Sequencers

  10. Objectives (1/2) • This research is about the use of active and semi-active replication techniques for achieving fault tolerance in ERF, which is a framework that uses ANDRCs. • Active replication technique • All replicated components accept third-party incoming events. • A middle-tier component is in charge of • Event multicasting • Detecting and suppressing duplicated events.

  11. Objectives (2/2) • Semi-active replication technique • All replicated components accept third-party incoming events • Only one (“the leader”) is able to post events, • Backup replicas listen to the leader to make a consistent production of events. • Each replicated component is in charge of the detection and suppression of duplicated events.

  12. Related Works • Generic support of FT in DSs • FT Event-Based DSs

  13. Generic support of FT in DSs • OMG Fault-Tolerant CORBA Standard

  14. FT Event-Based DSs

  15. Overview of the Event Rule Framework (ERF) • Model • Event Model • Rule Model • Behavioral Model • Components • Event Channel • RUBIES • Architecture of ERF-CORBA

  16. Event Model ERF provides the event abstraction to represent significant occurrences in a distributed system. i.e. Flood alert system. The base class Event defines the structure and behavior applicable to all types of events. Model package erf; import erf.lang.*; import java.io.Serializable; publicclass Event implements Serializable { /* Attributes */ public String id = ""; public TimeValue ttl; public TimeValue daytime; public DistributedObject producer; /* Methods */ public TimeValue t() {...} public TimeValue ts() {...} public TimeValue ttl() {...} publicvoid setttl(long tv) {...} public DistributedObject getProducer() {...} publicvoid setProducer(DistributedObject producer) {...} publicvoid sett(long tv) {...} public String pName() {...} public boolean isDead() {...} public String getTypeName() {...} } Figure 3.2 Java definition of the class Event

  17. Rule Model In ERF, the behavior of a DS is defined in terms of rules. A rule is an algorithm that is triggered when events in the event set match a rule’s event pattern Model [package <package_specification> ] rule <rule_id> [priority<priority_number>] on<trigger_events> [use<usage_specification>] [if<condition>then<actions> [else<alternative_actions>]] [do<unconditional_actions>] Figure 3.5 Syntax of rule definition language (RDL)

  18. Model • Behavioral Model • Defines how rules are triggered and evaluated upon the occurrence of events. • Evaluation of rules needs to be made periodically because RUBIES receive events constantly. • The evaluation of rules is performed based on a rule priority.

  19. Components (1/2) • Event Channel • Is a middleware distributed component • It allows sending events to consumers. • It allows receiving events from producers. • Events are treated as objects.

  20. Components (2/2) • Rule Based Intelligent Event Service (RUBIES) • Is the main component of ERF. • It is an engine that handles events through the evaluation of rules. • RUBIES is a distributed component • It is registered to the event channel both as a consumer and as a producer.

  21. Architecture of ERF-CORBA Figure 3.8 Architecture of ERF-CORBA

  22. Overview of the Fault Tolerant CORBA (FT-CORBA) • Fault Tolerant CORBA (FT-CORBA) • Replication Management • Fault Management • Logging and Recovery Management

  23. Fault Tolerant CORBA • Adopted by OMG through 2000. • Commitments rather than a solution. • Full interoperability among different products. • It provides support for applications that require • High levels of reliability • With minimal modifications. • This research was addressed to be compliant with this standard.

  24. Replication Management • Replication management covers a Fault Tolerant Domain. • It is done through the Replication Manager component, which inherits from the Property Manager, Object Group Manager, and Generic Factory components. Figure 4.3 Hierarchy of the Replication Management

  25. Fault Management • It includes the Fault Notification, Fault Detection, and Fault Analysis services. • The Fault Notifier sends informs to its consumers. • The Fault Detectors are connected to replicas or host and provide “faults” to the Fault Notifier. • The Fault Analyzer analyzes faults and produce reports to the Fault Notifier. Figure 4.8 Architecture of Fault Management

  26. Logging and Recovery Management • Loggin mechanism. • Log the state of the primary member. • Recovery mechanism • Act on fails or for new members. • Recover from the log to the new primary. • Consistency must be controlled by the infrastructure.

  27. Design of the Fault tolerant ERF (FT-ERF) • Scalability and Fault Tolerance Problems in ERF CORBA • Architecture of Scalable and Fault Tolerant ERF • Architecture of Fault-Tolerant ERF-CORBA • EID Uniqueness • Events and Pattern equality rules. • Pattern Management • Active Replication • Semi-Active Replication

  28. RUBIES (b) (a) RULES DB Scalability and Fault Tolerance Problems in ERF CORBA Figure 5.1 Two possible points of scalability and fault-tolerance problems in ERF: (a) the size of the rules database; (b) a crash of RUBIES.

  29. DISTRIBUTION DIMENSION RUBIES(γ11,δ1) RUBIES(γ21,δ2) RUBIES(γN1,δN) RUBIES(γ12,δ1) RUBIES(γ22,δ2) RUBIES(γN2,δN) REPLICATION DIMENSION RUBIES(γ1M,δ1) RUBIES(γ2M,δ2) RUBIES(γNM,δN) Figure 5.3 Architecture of Scalable and fault-tolerant ERF

  30. Figure 5.3 Architecture of FT ERF-CORBA

  31. EID Uniqueness (1/2) • Each event in the system need to be uniquely identified by an event identifier • EID. • EID uniqueness must guaranteed in different contexts • Local, replication group, system. • The use of sequencers is an option to achieve EID uniqueness • Each replica start a sequencer. • But, is only valid with deterministic components.

  32. EID Uniqueness (2/2) • Events can be identified by its history. • Each event is produced due to an event pattern. • Such history includes • The list of previous events that triggered the event. • The function or rule that caused its production. Figure 5.5 Conceptual View of the Event Unique Identification

  33. EVENT EQUALITY RULE • Two events are equal if: • Both are of the same Type. • Both were produced due to the same Rule. • Both have the same Order of production in the time when the Rule was triggered. • Both have the same Pattern.

  34. PATTERN EQUALITY RULE • Two event patterns are equal if: • Both have the same number of events. • Both have events in the same order. • Two events for the same position are equal if the Event Equality Rule is accomplished as previously defined.

  35. Pattern Management (1/2) • Rules use a pattern management framework • To prevent events being triggered more than once for a given event pattern. • In this framework, patterns are defined in terms of: • Source events (i.e., events that cause rules to trigger) and • Target events (i.e., events that are produced by rules).

  36. Pattern Management (2/2) • The framework has three main components for pattern management: • Pattern Manager to manage patterns of events. • Pattern to store patterns of events. • Indexer to organize patterns of events. Figure 5.6 Architecture of Pattern Management

  37. Active Replication (AR) • For systems with tight time constraints. • All replicas are running at the same time. • Are accepting events. • Are sending events. • So, duplicated events are going around. • Therefore, it is crucial. • To detect and suppress duplicated-events. • To deliver a unique reply. • To keep consistency. • To be fault tolerant transparent.

  38. AR: Pattern Naming • For Duplicated-Events Detection and Suppression • Is a centralized Mid-tier component that • Through an analysis of an event’s history • Detects if the event has already been delivered. • It relies on two primitives: • Event binding • Register an event. • Pattern solving. • Resolve if an equivalent event was already delivered.

  39. AR: Pattern Naming Figure 5.9 Architecture of the Pattern Naming

  40. Semi-Active Replication (SAR) • For systems with relatively loose time constraints. • All replicas are running at the same time. • Only the primary is able to reply to clients. • When the primary fails, a new member is selected. • When a backup member fails, it is released from the group. • Failure detectors are used to detect failures in group members. • Time delay before the selection of new primary (sec).

  41. SAR: Production Controller • For Duplicated-Events Detection and Suppression • It is distributed within each replica. • The following algorithm is executed on backup members. On incoming event P from the primary • If in queue BQis an equivalent event Bfor the event P then • UpdateB.id with P.id across the entire system • RemoveP • Else • Enqueue P in PQ On produced event B from the backup • If in queue PQis an equivalent event Pfor the event B then • UpdateB.id with P.id across the entire system • RemoveP • Else • Enqueue B in BQ On fail and if the backup replica is elected as new primary • Post all events of the queue BQ

  42. SAR: Production Controller Figure 5.14 Architecture of the Production Controller

  43. 6. Performance Analysis • Objectives • Methodology • Test Scenarios • Test Procedure • Test Results

  44. Objectives • Measure the execution time of fault-tolerant ERF using active and semi-active replication techniques for: • An increasing number of replicas. • An Increasing number of failures. • An increasing workload. • Compare the execution time of: • Active versus semi-active replication techniques. • Failure-free versus failure execution scenarios. • Fault-tolerant versus non fault-tolerant execution.

  45. Test Scenarios: Services distribution Figure 7.1 UML deployment diagram of the test environment. (The domain for all computers is ece.uprm.edu)

  46. Test Scenarios: Failure schedule: First scenario • Six workstations, • 3 to 8 replicas. • 193 rules. • Failure schedule defined by power set F. Where • n is the number of replicas • f(p=n) = ∞ • f(p=1...n-1) = p*T/n determines the time of the failure • p is the position of the replica in the sub set • T is the arithmetic average of the execution time of ten free failure runs with n replicas.

  47. Test Scenarios: Failure schedule: Second scenario • Ten workstations, • Ten replicas. • 193 rules. • Failure schedule defined by set G. Where • n is the number of replicas • g(p=n) = ∞ • g(p=1...n-1) = p*T/n determines the time of the failure • p is the position of the replica in the sub set • T is the arithmetic average of the execution time of ten free failure runs with n replicas.

  48. Test Scenarios: Failure scheduleThird scenario • Ten workstations, • Ten replicas. • Six rule sets of 6, 12, 24, 48, 96, and 193 rules each time. • The failure schedule was given by the function G(n) defined for the second scenario.

  49. Test Scenarios: Test application • Client consumer/producer of the event channel. • It starts sending two events of GageLevelReport type to start the test • It ends its execution when an event of the TestEventEnd type arrives. • Measures the execution time • Starting just after second event is posted, and • Ending just after a event of TestEventEnd type arrives.

  50. Methodology: Test Procedure • The procedure consisted of three major steps: • Clear the environment; • Launch the infrastructure; and • Run the test application. • The results are • The arithmetic media of 10 runs on each test case. • The arithmetic media of the standard deviation was 1.46%

More Related