1 / 11

Wide-Area Service Composition: Evaluation of Availability and Scalability

Cellular Phone. Video-on-demand server. Provider A. Provider R. Text to audio. Provider B. Transcoder. Email repository. Thin Client. Provider Q. Wide-Area Service Composition: Evaluation of Availability and Scalability. Bhaskaran Raman SAHARA, EECS, U.C.Berkeley.

Download Presentation

Wide-Area Service Composition: Evaluation of Availability and Scalability

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cellular Phone Video-on-demand server Provider A Provider R Text to audio Provider B Transcoder Email repository Thin Client Provider Q Wide-Area Service Composition:Evaluation of Availability and Scalability Bhaskaran Raman SAHARA, EECS, U.C.Berkeley

  2. Video-on-demand server Provider A Provider A Provider B Transcoder Provider B Thin Client Problem Statement and Goals • Problem Statement • Path could stretch across • multiple service providers • multiple network domains • Inter-domain Internet paths: • Poor availability [Labovitz’99] • Poor time-to-recovery [Labovitz’00] • Take advantage of service replicas • Goals • Performance: Choose set of service instances • Availability: Detect and handle failures quickly • Scalability: Internet-scale operation • Related Work • TACC: composition within cluster • Web-server choice: SPAND, Harvest • Routing around failures: Tapestry, RON • We address: wide-area n/w perf., failure issues for long-lived composed sessions

  3. Is “quick” failure detection possible? • What is a “failure” on an Internet path? • Outage periods happen for varying durations • Study outage periods using traces • 12 pairs of hosts • Berkeley, Stanford, UIUC, UNSW (Aus), TU-Berlin (Germany) • Results could be skewed due to Internet2 backbone? • Periodic UDP heart-beat, every 300 ms • Study “gaps” between receive-times • Results: • Short outage (1.2-1.8 sec)  Long outage (> 30 sec) • Sometimes this is true over 50% of the time • False-positives are rare: • O(once an hour) at most • Similar results with ping-based study using ping-servers • Take away: okay to react to short outage periods, by switching service-level path

  4. UDP-based keep-alive stream Acknowledgements: Mary Baker, Mema Roussopoulos, Jayant Mysore, Roberto Barnes, Venkatesh Pranesh, Vijaykumar Krishnaswamy, Holger Karl, Yun-Shen Chang, Sebastien Ardon, Binh Thai

  5. Internet Source Composed services Destination Application plane Peering: exchange perf. info. Service cluster: compute cluster capable of running services Functionalities at the Cluster-Manager Peering relations, Overlay network Service-Level Path Creation, Maintenance, and Recovery Logical platform Link-State Propagation Finding Overlay Entry/Exit Location of Service Replicas Service clusters Hardware platform At-least -once UDP Perf. Meas. Liveness Detection Architecture

  6. Leg-1 Leg-2 Text to audio Text Source End-Client Request-response protocol Data (text, or RTP audio) Keep-alive soft-state refresh Application soft-state (for restart on failure) Evaluation 1 • What is the effect of recovery mechanism on application? • Text-to-Speech application • Two possible places of failure • 20-node overlay network • One service instance for each service • Deterministic failure for 10sec during session • Metric: gap between arrival of successive audio packets at the client • What is the scaling bottleneck? • Parameter: #client sessions across peering clusters • Measure of instantaneous load when failure occurs • 5000 client sessions in 20-node overlay network • Deterministic failure of 12 different links (12 data-points in graph) • Metric: average time-to-recovery 2

  7. Recovery of Application Session:CDF of gaps>100ms 1 Recovery time: 2963 ms Recovery time: 822 ms (quicker than leg-2 due to buffer at text-to-audio service) Recovery time: 10,000 ms Jump at 350-400 ms: due to synch. text-to-audio processing (impl. artefact)

  8. AverageTime-to-Recovery vs. Instantaneous Load End-to-End recovery algorithm High variance due to varying path length 2 • Two services in each path • Two replicas per service • Each data-point is a separate run Load: 1,480 paths on failed link Avg. path recovery time: 614 ms

  9. Results: Discussion 1 • Recovery after failure (leg-2): 2,963 = 1,800 + O(700) + O(450) • 1,800 ms: timeout to conclude failure • 700 ms: signaling to setup alternate path • 450 ms: recovery of application soft-state: re-process current sentence • Without recovery algorithm: takes as long as failure duration • O(3 sec) recovery • Can be completely masked with buffering • Interactive apps: still much better than without recovery • Quick recovery possible since failure information does not have to propagate across network • 12th data point (instantaneous load of 1,480) stresses emulator limits • 1,480 translates to about 700 simul. paths per cluster-manager • In comparison, our text-to-speech implementation can support O(15) clients per machine • Other scaling limits? Link-state floods? Graph computation? 2

  10. Evaluation Analysis Design Summary • Service Composition: flexible service creation • We address performance, availability, scalability • Initial analysis: Failure detection -- meaningful to timeout in O(1.2-1.8 sec) • Design: Overlay network of service clusters • Evaluation: results so far • Good recovery timefor real-time applications: O(3 sec) • Good scalability -- minimal additional provisioning for cluster managers • Ongoing work: • Overlay topology issues: how many nodes, peering • Stability issues Feedback, Questions? Presentation made using VMWare

  11. Rule for 12 App Emulator Node 1 Rule for 13 Lib Rule for 34 Node 2 Rule for 43 Node 3 Node 4 Emulation Testbed Operational limits of emulator: 20,000 pkts/sec, for upto 500 byte pkts, 1.5GHz Pentium-4

More Related