Resource Overbooking and Application Profiling in Shared Hosting Platforms

Resource Overbooking and Application Profiling in Shared Hosting Platforms Bhuvan Urgaonkar Prashant Shenoy Timothy Roscoe † UMASS Amherst and Intel Research† Fifth USENIX OSDI, Boston, Dec 2002

Internet Introduction cluster E-commerce • Proliferation of Internet applications • E-commerce, streaming media, online games, … • Commonly hosted on clusters of servers • Cheaper alternative to large multiprocessors Streaming Clients Games

Hosting Platforms • Hosting platform: server cluster that runs third-party applications • Applications pay for server resources • CPU, network bandwidth, memory, disk • Platform provider guarantees resource availability • Challenge: Maximize # hosted applications while providing resource guarantees

Design Challenges • How to determine an application’s resource needs? • How to provision resources to meet these needs? • How to map applications to servers in the platform? • How to handle dynamic variations in load?

Talk Outline • Introduction • Inferring Resource Requirements • Provisioning Resources • Mapping Applications to Servers • Experimental Evaluation • Related Work

http App serv DB serv Terminology • Hosting platform models • Dedicated: Applications get integral # nodes • Shared: Applications may get fractional # nodes Applications Platform nodes • Capsule: component of an application running on a node

Provisioning By Overbooking • Worst-case provisioning is wasteful • Low utilization of resources • Applications may be tolerant to occasional violations • E.g., CPU guarantees should be met 99% of the time • Possible to provide useful guarantees even after provisioning less than worst-case needs • Overbook resources to improve utilization • E.g., Airline reservations

Application Profiling • Profiling: process of determining resource usage • Run the application on an isolated set of nodes • Subject the application to a real workload • Model CPU and network usage as ON-OFF processes • Use the Linux Trace Toolkit [Yaghmour00] Begin CPU quantum End CPU quantum time ON OFF

1 Probability 0 1 Resource Usage Distribution ON-OFF PROCESS Measurement Interval time CDF PDF 1 0.99 Cumulative Probability A B 0 1 Fractional usage Fractional usage

Streaming Media Server, 20 clients Apache Web Server, 50% cgi-bin 0.3 0.3 0.25 0.25 0.2 0.2 Probability Probability 0.15 0.15 0.1 0.1 0.05 0.05 0 0 0.1 .2 0.3 0.4 0.5 0.6 0.7 0.8 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Fraction of NW bandwidth Fraction of CPU Profiles of Server Applications Postgres Server, 10 clients • Applications exhibit different degrees of burstiness • Need to capture variability in resource usage 0.1 0.08 0.06 Probability 0.04 0.02 0 0 0.2 0.4 0.6 0.8 1 Fraction of CPU

Capturing Burstiness: Token Bucket • Token Bucket (σ, ρ) • Resource usage over t ≤ σ.t + ρ σ1.t + ρ1 σ2.t + ρ2 usage ρ2 time ρ1 ON-OFF PROCESS time • Choose (σ, ρ) based on a high percentile

Resource Overbooking Mechanism • Applications specify overbooking tolerance Oi • Probability with which capsule needs may be violated • Controlled overbooking via admission control: • Resource requirements of all capsules are met ΣK(σk·Tmin+ρk)·(1 - Ok) ≤ C·Tmin • Overbooking tolerances of all capsules are met Pr (ΣKUk > C) ≤ min (O1,…,Ok) • A node that has sufficient resources for a capsule is feasible for it

Mapping Capsules to Nodes 1 1 1 1 • A bipartite graphs of capsules and feasible nodes • Greedy mapping: consider capsules in non-decreasing order of degrees • Multiple feasible nodes => random, best fit, worst fit… 2 2 Final Mapping 2 3 3 3 3 4 4 capsules capsules nodes nodes

Talk Outline • Introduction • Inferring Resource Requirements • Provisioning Resources • Mapping Applications to Servers • Experimental Evaluation • Related Work

The SHARC Prototype • A Linux-based Shared Hosting Platform • 6 Dell Poweredge 1550 servers • Gigabit Ethernet link • Software Components • Profiling • Vanilla Linux + Linux Trace Toolkit • Control plane • Overbooking, placement • QoS-enhanced Linux kernel • HSFQ schedulers

Experimental Setup • Prototype running on a 5 node cluster • Each server: 1 GHz PIII with 512MB RAM and Gigabit ethernet • Control plane runs on a dedicated node • Applications run on the other four nodes • Workload: mix of server applications • Apache web server with SPECWeb99 (static & dynamic HTTP) • PostgreSQL database server with pgbench (TPC-B) benchmark • MPEG streaming server with 1.5 Mb/s VBR MPEG-1 clients • Quake I game server with “terminator” bots

Placement of Streaming Media Servers 350 No Ovb Ovb=1% 300 Ovb=5% 250 200 Media Servers Placed 150 100 200 50 0 0 0 20 40 60 80 100 120 140 Number of Nodes Resource Overbooking Benefits Placement of Apache Web Servers 1400 No Ovb Ovb=1% 1200 Ovb=5% 1000 800 Web Servers Placed 600 400 0 20 40 60 80 100 120 140 Number of Nodes • Small amounts of overbooking can yield large gains • Bursty applications yield larger benefits

Performance with Overbooking Performance of Apache Performance of Postgres 70 25 60 20 50 15 40 Throughput (trans/s) Throughput (req/s) 30 10 20 5 10 0 0 Isolated 100th 99th 95th Average Isolated 100th 99th 95th Average CPU Provisioning CPU Provisioning • Performance degradation is within specified overbooking tolerance

Apache Web Server, Overload Apache Web Server, Offline Profile Apache Web Server, Expected Workload 0.3 0.3 0.3 0.25 0.25 0.25 0.2 0.2 0.2 Probability Probability Probability 0.15 0.15 0.15 0.1 0.1 0.1 0.05 0.05 0.05 0 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Fraction of CPU Fraction of CPU Fraction of CPU Handling Flash Crowds • Detect overloads by online profiling • Reacting to overloads (ongoing work) • Compute new allocations • Change allocations, move capsules, add servers

Related Work • Single node resource management • Proportional share schedulers: WFQ, SFQ, BVT, … • Reservation based schedulers: Nemesis, Rialto, … • Cluster-based resource management • Cluster Reserves [Aron00] • MUSE [Chase01]: economic approach • Oceano [IBM], Planetary computing [HP] • Clusters for high availability: Porcupine [Saito99] • Grid computing [Globus]

Concluding Remarks • Resource management in shared hosting platforms • Application profiling to determine resource usage • Controlled overbooking to improve utilization • Mapping applications to servers • Future work • Handling dynamic workloads • Managing memory and disk bandwidth • URL: http://lass.cs.umass.edu

Resource Overbooking and Application Profiling in Shared Hosting Platforms