250 likes | 407 Views
Systems Support for End-to-End Performance Management. Sandip Agarwala PhD Advisor: Karsten Schwan College of Computing Georgia Tech. Complexity, complexity, complexity…. Source: Gartner (December 2005). Reasons for Complexity. Application diversity Interdependencies
E N D
Systems Support for End-to-End Performance Management Sandip Agarwala PhD Advisor: Karsten Schwan College of Computing Georgia Tech
Complexity, complexity, complexity… Source: Gartner (December 2005)
Reasons for Complexity • Application diversity • Interdependencies • Heterogeneous components • Too many different technologies and platform • Too little “hints” from the system to the administrators • Legacy issues; Application-specific solutions • Insufficient information about the system to drive self-management Lack of Automation
Online System Management Analyze Monitor Control Execute Workload • Scheduling • Capacity and SLA management • Design evaluation and tuning • Bottleneck detection • Resource provisioning, accounting, etc. Proposed Approach: Service Path
Service Path Data Base Back - end Application Logic (EJBs, etc.) Middle-tier Servlet Server Front - end Web Servers Proxy Server I n t e r n e t • System abstractions that describe the dynamic dependencies between the different distributed application components • Service Class: Application-level request class, e.g. SLA class
Service Path Characteristics • End-to-End analysis • Online • Non-intrusive • Application-generic
Outline • Background • Motivation • Service path • Discovery with E2EProf • Refinement with SysProf • Automated SLA Enforcement • Related Work • Future Plans
E2EProf D1 A C time B D2 X D time (AB) time (BC) time • Black-box approach • Correlate per-edge time series signals • Monitor network packet traces (source, destination, timestamps) Model traces as per-edge time series signals or density functions
Basic Approach (AB) (BD) (AB) (BC) Delay at B • Compute cross-correlation (D1 D2) A C B X D SpikeCausality Spike’s position Delay No spike
Evaluation with 4-tier RUBiS1 Tomcat Server 1 EJB Server 1 Clients I/O bound MySQL Server Apache Web Server comment bidding CPU bound Tomcat Server 2 EJB Server 2 1http://rubis.objectweb.org/
Service Path Detection in RUBiS Round-robin load balancer Highest delay node Highest delay nodes Highest delay node Static server assignment
Change detection in RUBiS Injected Delay
Delta Air Lines’ Application Revenue Pipeline Total Traffic: 1.34 million / day (56k / hour) TACSIN & TACSOUT APEXIN & APEXOUT Error/Warning (Tivoli) Logs XIN & XOUT
Delta Air Lines’ Application Client requests TACS Latency (sec) S1 S2 S3 S7 S8 Time of the day TACS Huge request burst
Outline • Background • Motivation • Service path • Discovery with E2EProf • Refinement with SysProf • Automated SLA Enforcement • Related Work • Future Plans
Beyond dependency and latency… S2 S6 C1 S4 S1 C2 S3 S5 • Solution: Zoom into the servicepath with SysProf • No application hints or instrumentation • Monitor resource usage on per-class basis
SysProf Methodology • Track request context • Work done for processing a request class • May span user-level or kernel-level • Executes in more than one contexts (e.g. processes, threads, softirqs) • Happens in a system-visible event (e.g. system calls) system call parameters, PID, App functions A1 A2 AN User Kernel Scheduler System Call Scheduler Net softirq Network Stack FS/ VM/ etc. Context Switches Context Switches Disk I/O Init CID eth driver BDD From client Instrumentation points To client
Class ID Propagation Process CID Msg CID Middle-Tier End-Tier Front-Tier User Kernel Init CID From client To client Packet CID Inherits CID
Application of SysProf • Resource Accounting • Utility Billing • Bottleneck detection • Capacity Estimation • Root-Cause Analysis • Black-Box SLA management
Resource-Aware Adaptive Control Separate Queue/Controller for each cluster Tomcat Server 1 EJB Server 1 Controller + Scheduler MySQL Server Class 1 Class 2 Front-end Tomcat Server 2 EJB Server 2 Class 3 Cluster workloads contending for same resources
Resource-Aware Adaptive Control Capacity = 80 req/s per server No SysProf With SysProf
Summary • Service Path • System abstractions to represent dependencies and request path • E2EProf and Pathmap • Dependency and latency analysis • SysProf • Service-based resource analysis • Aid human operator and automate end-to-end performance management
Thank You! Questions? Email: sandip@cc.gatech.edu
Pathmap Optimizations time time Packet timestamp trace Bursty traffic Sliding window (W) W Run-length compression Time-series signal Or Density Function Upper-bound On latency time Cross-correlation series