Instrumentation Strategies for Response Time Management of Distributed Systems

Instrumentation Strategies forResponse Time Management of Distributed Systems Greg Rogers MACP Consulting

Personal Introduction • Architect for three response time measurement & analysis system projects (project manager for one) • MACP Consulting, 2007 (Measurement, Analysis, Capacity & Performance) • ~20 years commercial field, operating system internals; measurement & performance analysis; capacity planning/analytical & statistical modeling; databases • Digital Equipment Corp (Digital, DEC); Compaq Computer Corp; Hewlett-Packard; early career @ Grumman Aerospace • B.S. Statistics, Minor Computer Science, California Polytechnic State University MACP Consulting

Time • We all have an intuitive feel for what time is, since we were children • “Are we there yet?” • Next, our parents and teachers made sure our intuitive notions graduated to the quantitative • Early measurement: Learning to read the clock and “tell time”; calculate elapsed time • Time is a sequence of events, one after the other (Einstein?) MACP Consulting

Response Time (Rt)Definition; Terminology; Viewpoint • Time measured from the initiation of some action, event or request until completion of the action or event, or initial receipt of the response • Terminals • GUIs • Service time (St) is a subset of Response time (Rt) • Queue analysis • Specifics depend on viewpoint – Where and What • Not philosophical • Viewpoint: The system or component(s) of interest • Where, what part of “the system” is to be examined, to be measured? • In fact, what is the system? • Rt vs. Residence time “in the literature” • System vs. component of system • Lazowska (1984); Gunther (2000,2005); Menasce(1993) MACP Consulting

Why Measure Response Time? • THE Quality Measurement • Primary business perception of IT • SLAs • Management bragging rights? MACP Consulting

Original Flavor, a.k.a. The Good Old Days:Measurement on Monolithic Systems(do they even exist anymore?) • Users connected through ye olde character cell terminals • a.k.a. green screens • User’s transaction normally executed within context of a single system • 2008: Single system = operating system instance = “image" MACP Consulting

Monolithic Host User serial terminals Instrumented terminal driver & interactive process

Distributed Architecture Response Time • Client/Server 2-tier, to 3-, 4-tier, multi-tier distributed systems • End-to-end Rt • Normally viewed from (business) user perspective • Rt of user’s web form entry; click corresponding to some business transaction; etc. • Total time from click or carriage return (initial request) to first character, packet, data item of the response • In other words, sum of time for all visits across architectural tiers by the user’s transaction • Can be measured at client, or just before or at the first tier of the infrastructure MACP Consulting

Two-tier Client/Server : Hint of the Explosion (and Troubleshooting Difficulty) To Come… Server Client Request Response

Four-Tier Web Architecture (Three-Tier Measured Rt) The Explosion Is Here! End-to-End Rt Client Web DB App

Multi-tier Web End-to-End Rt Client Web DB App External System(s)

Issues With Multi-Tier Distributed Systems • “Sum of time for all visits across architectural tiers by the transaction” (previous slide) • If a transaction is slow, where is the slowdown occurring? • Distributed systems not instrumented in an integrated fashion; i.e., no standards “easily implemented” (development impact) to provide this data to operations/performance/capacity planning • Mythology pervades troubleshooting distributed environments due to lack of essential cross-tier Rt data • Typical Approach: “Guilt by Correlation” • Look at each server (or all servers within a functional tier if one is lucky) • Visually correlate in time, high activity on server(s) in one tier with high activity with server(s) in the next tier MACP Consulting

Clocks and Time Measurement forMultiple-tier Distributed Architectures • Time synchronization across systems is critical • Standard: Network Time Protocol (NTP) • One to sub-second accuracy across systems on a LAN • Storage subsystems often do not support NTP • Specialized, high accuracy, non-distributed server-attached clocks • Cellular telephone tower clock signals • Global Positioning System (GPS) clock signals • Accuracy to ~tens of microseconds MACP Consulting

Categories of Rt Instrumentation • Active • Host-based, on systems executing business applications • Passive • No software on host systems • Hybrid • Uses both techniques • These definitions tend to be from a server-centric viewpoint • A network-centric viewpoint of active vs. passive might be whether or not traffic is injected onto the network – Krishnamurthy (2001) • Server-centric, since our goal is to provide a breakdown of a large end-to-end response time into each individual tier’s response time • The fact network and server response time components are measured is incidental to this goal MACP Consulting

Active Rt Monitoring Techniques • Host-based - Most common & familiar • Synchronous sampling of event-driven Rt accumulators • Asynchronous (event-driven; i.e., when specific event occurs) • Web server Logging • Rich data source often mined and written to multi-dimensional data warehouses for customer behavior pattern analysis • Can be great source of distributed Rt data but requires custom development to process into usable data • Middleware Logging • Transaction processing, transaction reformat/redirect systems • Also a very rich source, also needs custom development to process • Application-level Logging • Custom routines or standard Application Programming Interfaces (APIs) MACP Consulting

Active Rt Monitoring Techniques, cont’d • Application Response time Measurement (ARM) API • Standardization effort initiated by HP & Tivoli, adopted by Open Group 1999 • CMG has a dated Q&A still useful as introductory info (ignore links) • http://regions.cmg.org/regions/cmgarmw/armfaq.html • Callable routines in C & Java for developers to instrument their code for collecting Rt data • Current version ARM 4.0 v2 • http://www.opengroup.org/management/arm/ • Brief history: • http://findarticles.com/p/articles/mi_m0EIN/is_1999_Jan_26/ai_53640469 • ARM is a moderately successful standard • SAS implements ARM in their products and exposes it through macros • http://support.sas.com/rnd/scalability/tools/arm/armapi.html • Siebel CRM implements ARM in its “SARM” logging levels, some of which can impact the server and hence are oriented toward debugging, not routine data collection MACP Consulting

Active Rt Monitoring Techniques, cont’d • Middleware-dependent APIs • Java Management Extensions (JMX) • http://java.sun.com/javase/technologies/core/mntr-mgmt/javamanagement/ • Java Virtual Machine (JVM) Bytecode Instrumentation • Commercial products used to profile or analyze Java app performance • The best way to peer into that little JVM black box, but I digress… • Dynamically loaded at run-time • Useful source of distributed Rt data • Java method calls (“methods”) to remote systems • In this case, the method Rt is a distributed Rt! • Methods making remote calls can be discovered via sorting method names by Rt • Not a bad way to go in situations where little is known about deeper levels of the application • Make a map of remote call methods • Might be a way to filter further and characterize different transaction classes by a given remote method if the data regularly show high Rt variability MACP Consulting

Active Rt Monitoring Techniques, cont’d • Synthetic Sampling • Injecting synthetic requests onto the network from PC-based “robots” and measuring response time of “representative” user transactions • Similar idea to how Keynote Systems measures the top Internet web sites • Has been a popular technique implemented by a number of large vendors for products that sample of end-to-end Rt measurements on corporate LANs/WANs • More and more customers are demanding that all their business transactions measured, not just a small sample of synthetic requests • Applications are typically intolerant of destructive (write) synthetic transactions (e.g., accounting). Workarounds usually circumvent quality of measurement (e.g., hitting the same “dummy” accounts) • Used in isolation, synthetic sampling can and will miss causes of long response times • If already in place, well-implemented synthetic sampling is a “known load” for passive monitoring until the latter can be used to more fully characterize Rt of the real business transaction load MACP Consulting

Active Rt Monitoring Techniques, cont’d • Insertion of tags, markers, IDs into protocol headers • Relatively recent on the commercial landscape • Custom implementations exist in end-business development organizations • Strong, forward-looking architects and management team seeing business benefit • Commercial: • Agent on host inserts tag into outbound protocol/message header • Agent at next tier reads tag • Logs either locally or centrally for post-processing • Tracks transactions across tiers, calculates time spent on each tier • “The bouncing ball” • May measure all traffic but only some of the time, not always “on” • Custom: • Application instrumented to insert tags in its own requests and responses • Local logging of raw data, post-processing on business system or central system, insertion into central DB with custom-built visualization and reporting software MACP Consulting

Difficulties With Some Active Techniques • Logging and log processing require development resources • Developers may view instrumentation as another potential source of bugs • Can be viewed as a delay factor in time-to-market • Custom implementation requires strong architects and management to make the case and see it through in each development release • Perceived to impact another limited resource: Testing cycles • Logging levels (degree of detail, types of data logged) can be implemented to limit impact, but sometimes the logging level needed for “useful” data imposes significant resource utilization overhead for over-taxed servers, or worse, increases application service time (execution time) overhead, affecting throughput scalability MACP Consulting

Passive Rt Monitoring Techniques • Recent technology-driven innovation & economics makes passive instrumentation possible • Processors; NICs; PCI-express I/O; Serial Attached SCSI (SAS) disks; open source software • Deeper innovation makes passive instrumentation a reality • Multi-threaded, efficient software design in particular • Passive monitoring may be commonly referred to as network sniffing, but this is misleading – the technology is far more capable and sophisticated than a “network sniffer” implies – This is real time processing of the complete traffic stream • Widely deployed in network security monitoring • Though most solutions do not process the entire packet MACP Consulting

Passive Rt Monitoring Techniques, cont’d • Passive Rt measurement techniques read all network packets at strategic points in the network • Either part or all of each packet is processed • Answers the question, “The response time of what?” [component] • Is it a technical item or a business transaction a manager would care about the response time of? • The more of each packet processed: • The more business value can be delivered – The business context of the transaction is typically at the deepest layer (see slides) • Measured Rt of business-critical transactions (not only technical IT items); time series counts (throughput); per-transaction or per-transaction class resource profiles (network) • The heavier the load on the probe – f(λ) (packet arrival rate) • Some points in a network of multi-tier distributed systems can be real fire hoses! • Major challenge for passive monitoring vendors who try to add value beyond IP, TCP or HTTP headers – do they report dropped packets? MACP Consulting

Passive Rt Monitoring Techniques, cont’d • Beware of vendor-speak • Know your terms and exactly what the “client” and “server” are at any point in a logical infrastructure when capabilities are being discussed • Use diagrams and take your time to first understand what is being measured in your infrastructure by the vendor’s solution • Reports, graphs, etc. come afterward • Reporting adds tremendous value but understanding how the fundamental measurements relate to your infrastructure is crucial for determining whether it can solve your business & IT challenges • Assumptions are often unspoken. Ask more than enough questions and get the answers necessary for everyone to be clearly on the same page • Is all of the Rt data acquired passively? Are there points in the infrastructure where the response time measurement solution active, not passive? At what logging level; i.e., at what level of impact to the [business-critical] server? Trust but verify… • Does the solution measure all traffic, all the time, or only part of the time? Anything less than all traffic is sampling, and can miss • At what part of the packet does the passive solution stop reading? Does it stop at the TCP header and call the rest of it “the application” • Some are skilled at leaving people with the impression or belief that their solution can do things that it in fact cannot do MACP Consulting

Passive Rt Monitoring Across Tiers (Logical Measurement Points In Between Tiers) User’s End-to-End Rt App-DB tier Rt Web tier Rt Web-App tier Rt Client Web App Database Complete end-to-end Rt measurement for remote clients may require single passive measurement at each client location or active client measurement software

Network-Centric View of Packet Ethernet Frame IP Packet Header TCP Message Header “Application” Telnet; FTP; SMTP; DNS; NNTP; HTTP… Ethernet Frame CRC MACP Consulting

Business/Performance/CP/Application-Centric View of Packet (Business context is often deep inside message body of last protocol) Ethernet Frame IP Packet Header TCP Message Header HTTP Header XML Message Header XML Message Body Ethernet Frame CRC Ethernet Frame IP Packet Header TCP Message Header Proprietary Middleware Message Header Proprietary Middleware Message Body Ethernet Frame CRC MACP Consulting

Passive Benefits • Development of passive measurement solutions easily proceeds without impact to application development & test cycles • One-time scheduled downtime, connect taps or configure span ports • Everything afterward is, well,… Passive! • Taps electrically prevent probes from inadvertent writing into the production network path • Quality measurements of any and all business transactions • Much additional business data can be filtered, persisted and reported • Aid to testing & development: Visibility • Aid to production: Visibility (knowledge), myth-busting quantitative performance data, improves time to problem resolution • And delicious capacity planning data… • Detailed traces and very precise timing MACP Consulting

Instrumentation Strategies for Response Time Management of Distributed Systems

Instrumentation Strategies for Response Time Management of Distributed Systems

Presentation Transcript

Time in Distributed Systems

Time management strategies

Frequency Response of Discrete-Time Systems

Investigating Lightweight Fault Tolerance Strategies for Enterprise Distributed Real-time Embedded Systems

Time Management Strategies

Time for High-Confidence Distributed Embedded Systems

Distributed Systems Management

Time management strategies

Time in Distributed Systems

Strategies for Response

Design of Distributed Real-Time Systems

Distributed Systems Management

Time Management: Strategies for Life

Distributed Database Management Systems

Communication strategies for distributed embedded systems

Distributed Database Management Systems

Distributed Systems Management

Time Management Strategies

Distributed Database Management Systems

Scheduling of Distributed Real-Time Systems

REAL-TIME DISTRIBUTED SYSTEMS

Distributed Systems for Information Systems Management