Self-Managing Federated Services

Self-Managing Federated Services Francisco Matias Cuenca-Acuna and Thu D. Nguyen Department of Computer ScienceRutgers University

Federated Computing • Rising Internet connectivity is driving a new model of federated computing • Computing systems that span multiple organizations • Sharing of computing resources, data, and services • Federated computing appearing at every level • Peer-to-peer: e.g., Netnews, Gnutella, KaZaA • Business-to-business: e.g., federated web services • Scientific computing: e.g., grids (http://www.gpds.org/), PlanetLab • Federated applications are hard to build and manage because they must execute in environments that are • Inherently decentralized • Widely distributed • Widely heterogeneous • Highly volatile PANIC Lab, Rutgers University

PlanetP • Infrastructure support for federated computing • Data Management • Collaborative organization of shared data • Content search, rank, and retrieval • Automatic replication for predictable availability (SRDS 2003) • Application management • Automatic (self-) management of #components and their placements • Provide a common application container • PlanetP approach • Strongly consistent protocols scale poorly in wide-area environments [Gray et al. 1996, Gupta et al. 2002] • Build on probabilistic and weakly consistent protocols • Weakly consistent shared information • Autonomous actions • Randomized algorithms • Let’s deal with system sizes of up to several thousands first PANIC Lab, Rutgers University

Application Management: Goal • Reduce the deployment and management of an application to • Defining an application “fitness” model • Specify QoS targets such as service availability, max throughput • Self-management of number of components and their placements to achieve QoS targets PANIC Lab, Rutgers University

App App App App Application Management Network App PANIC Lab, Rutgers University

ManagementAgent App App MonitoringAgent • Publish: • A replica running here • Current app load • Publish: • Node description • Current node load • Availability App Our Approach: Infrastructure App App P2P Information Sharing Infrastructure (PlanetP Data Store) App PANIC Lab, Rutgers University

Our Approach: Algorithm • Periodically, each management agent autonomously searches for a “more fit” configuration • Configuration: #components and their placements • Genetic algorithm (directed random search) • If find a configuration that is better than the current configuration by more than a threshold value, deploy it • However, wait for a random delay time before deploying new configuration to avoid collisions • Publish new configuration after it has been successfully deployed PANIC Lab, Rutgers University

Example Application Model • Fitness is defined as a function f(capacity, #replicas) • Expected capacity of a configuration is computed using node availability and CPU idleness • Function designed to achieve • Sufficient capacity to meet current load • Minimize number of replicas to minimize replication overheads • Spare capacity to gracefully handle load spikes • Two QoS parameters • Availability • Maximum load PANIC Lab, Rutgers University

V 0 T Probabilistic boundon inconsistency inshared data Deploymenttime Random Delay • Choose T to achieve a target probability of collision • Case study: • 200 nodes • Gossiping interval = 1 sec • Deployment time small • V = 8secs • Pcollision = 0.1 PANIC Lab, Rutgers University

Evaluation: Environments • Study responsiveness and stability in three environments • Unloaded cluster • 44 PCs interconnected with 100 Mb/s & 1 Gb/s Ethernet • 22 2GHz P4 PCs rated at 4000 bogomips • 22 2.8GHz Pentium Xeon with hyper-threading PCs rated at ~11200 bogomips • Same cluster loaded by other researchers (large simulation jobs) • PlanetLab • Federated system fordistributed systemsand networking research • Widely heterogeneousx86 PC nodes:800 – 5600 bogomips PANIC Lab, Rutgers University

jUDDI HSQLDB/R Evaluation: UDDI Service Tomcat Runtime Site 2 Web server Site 3 PlanetP Manager Agent Site 4 Java Runtime Site 1 Mon.Agent PlanetP community Base gossiping interval = 1sec PANIC Lab, Rutgers University

Evaluation: UDDI Service • Populated UDDI service with data from Xmethods • Web site listing publicly available web services • 400 services  3MB registry database • Clients issue random findBusiness queries • Fitness function for UDDI service set to: • Max desired CPU resources: 60,000 bogomips • ~20 P4 PCs • Max of 4 to 20 replicas PANIC Lab, Rutgers University

Cluster: No. Replicas vs. Load PANIC Lab, Rutgers University

Cluster: Acquired Capacity vs. Load PANIC Lab, Rutgers University

Cluster: Throughput vs. Load PANIC Lab, Rutgers University

PlanetLab: No. Replicas vs. Load PANIC Lab, Rutgers University

PlanetLab: Acquired Capacity vs. Load PANIC Lab, Rutgers University

PlanetLab: Throughput vs. Load PANIC Lab, Rutgers University

Summary and Future Work • Described decentralized management framework for federated applications • Autonomous actions based on weakly consistent shared data for robustness and scalability • Probabilistic serialization to avoid collisions • Responsive and stable when managing an example UDDI service in 3 test-beds • More sophisticated application models? • Predictive model for resources • Account for cost of changing application configurations • Applications with multiple types of components PANIC Lab, Rutgers University

Self-Managing Federated Services