By: Ashraf Amrou Emad Ramadan Subodh Keskar

CS775/875 Distributed SystemsFall 2001 Course ProjectDISRETA Distributed System to Support a Retail-Chain By: Ashraf Amrou Emad Ramadan Subodh Keskar

Overview • Introduction. • Goals. • System architecture. • Implementation. • Distributed Issues: Caching. • Distributed Issues: Replication. • Distributed Issues: Fault-tolerance. • Distributed Issues: Transactions. • Distributed Issues: Concurrency & Consistency. • Distributed Issues: Scalability. • Conclusion.

Introduction • DISRET: is a distributed system to support a Retail-Store chain. • The system is hierarchically organized. • Hierarchy levels: • National Office • Regional Office • Zonal Office • Retail-Store

Nation Region Region Admin Zone Zone RS RS RS RS Front End System Architecture System architecture: 4 Levels + Management + Clients

Architecture/Implementation: Components • Implementation Using: Java + Java RMI. • Main classes: • Node.java: • - implements the functionality of a node at any level of the hierarchy. • - provides the user interface for available operations. • Management.java: • - unique ID’s for node instances. • - maintaining the hierarchy. • - provides a user interface for both management and application queries/updates.

Architecture/Implementation: Components Continued • Supporting classes: • Item.java: implements the Item. • ItemVector.java: a vector of Items. • Employee.java : implements the Employee. • EmployeeVector.java: a vector of Employee. • FileIO.java: some File I/O operations that support the system. • UpdatesMissed.java: a vector of updates missed by children or replica. (can be used to support the handling of the problem of Partial Execution)

Distributed Issues: Caching • Goal: caching should be utilized whenever possible to enhance performance. • Implemented: we implemented the caching for Q2 (# employees) and Q3 (total monthly payroll). • Freshness of the cached data: cached data expires after periods that depend on the frequency of updates. • Extensions: Other queries can utilize caching in the same way. • Comments: In our implementation, caching is used only when the nodes are not available. In real life, however, cashing may be used even though the node is available for more enhanced performance. This is very beneficial for high workload environments.

Distributed Issues: Replication • Goal: Replication should be utilized whenever possible. The main purpose is fault-tolerance in addition to more availability. • Implemented: Passive Replication: 1) Supported for all levels. 2) Updates are sent to all replicas (eager approach) 3) A replica can be started for any node at any time. 4) Should the primary fail, the replica with the lowest ID is promoted. (similar to Bully) • Comments: - High replication causes more overhead. - It is highly recommended that replicas be running on failure independent machines (although not enforced).

Distributed Issues: Fault-tolerance • Only fail-stop. (No Byzantine fault tolerance) • Supported in part by replication. • Depends on the actual configuration of the system (e.g., # replicas for the nodes). • To tolerate n failures of a node you need n + 1 replicas. • Also, cached data is used whenever there is a (completely) dead node. This is done transparently. • An extreme case: If it happened that all the replicas of are down at a given time, replace the one that failed last first. This ensures that the last state of the node is captured.

Problem of Partial Execution • Definition: an update that involves multiple nodes (e.g., Add Item at levels above the Retail-Store) completes at some of the nodes and fails at the others. • Handling: 1) missed updates can be queued for the failed node (by its parent). 2) Another solution is to implement the whole update as a distributed transaction. (requires transaction support). Can be implemented using Jini or CORBA

The essence of Transactions:“A note on: Inter-Store Transfer” • Inter-Store transfer is an operation that involves two Retail-Stores. It needs to be executed atomically. • In our implementation, we do our best to enforce the atomicity relying on the high availability and fault tolerance supported by replication. But this does not guarantee the atomicity. • Implementation: if(both stores are available){ update both of them; if(both failed) retry later; else if (one failed){ cancel the update at the other; if(cancellation failed) report to the adminstrator; } }

Distributed Issues: Concurrency & Consistency • Concurrency is supported through the multi-threading of Java RMI. • Consistency is enforced in face of concurrent invocations from the clients using Java synchronization. • To guarantee consistency of data in case of frequent failures, if all replica of a node fails, replace the one that failed last before the others. • This is an extreme case and is not likely to happen in real life (except with extremely low probability)

Distributed Issues: Scalability (& Flexibility of configuration) • Our systems offers high flexibility. • Nodes can be added or removed at any time while the system is running. • Replicas can be added at any time while the system is running. • Parent-Child relationship can be reconfigured at any time while the system is running. • Our design and data Structures allow the growth of the system to even millions of nodes.

Conclusion and Future work Our System : • Offers fault-tolerance and high availability through replication and caching of query results. • doesn’t support transactions. But if we are given more time, this is an interesting issue to be targeted. • Only crash failure are tolerated. We think it is exciting to investigate the tolerance of Byzantine failures using active replication

Comments • This is a real-life problem. In the design and implementation, you face the majority of distributed system concepts and issues discussed in class. • It is a very interesting experience. Should we have more time, we will spent it enhancing the system.

By: Ashraf Amrou Emad Ramadan Subodh Keskar