1 / 45

The Mystery Machine: End-to-end performance analysis of large-scale Internet services

The Mystery Machine: End-to-end performance analysis of large-scale Internet services. Michael Chow David Meisner , Jason Flinn , Daniel Peek, Thomas F. Wenisch. Internet services are complex. Tasks. Time. Scale and heterogeneity make Internet services complex.

Download Presentation

The Mystery Machine: End-to-end performance analysis of large-scale Internet services

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Mystery Machine:End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch

  2. Internet services are complex Tasks Time Scale and heterogeneity make Internet services complex Michael Chow

  3. Internet services are complex Tasks Time Scale and heterogeneity make Internet services complex Michael Chow

  4. Analysis Pipeline Michael Chow

  5. Step 1: Identify segments Michael Chow

  6. Step 2: Infer causal model Michael Chow

  7. Step 3: Analyze individual requests Michael Chow

  8. Step 4: Aggregate results Michael Chow

  9. Challenges Need method that works at scale with heterogeneous components • Previous methods derive a causal model • Instrument scheduler and communication • Build model through human knowledge Michael Chow

  10. Opportunities Tremendous detail about a request’s execution Coverage of a large range of behaviors • Component-level logging is ubiquitous • Handle a large number of requests Michael Chow

  11. The Mystery Machine 1) Infer causal model from large corpus of traces • Identify segments • Hypothesize all possible causal relationships • Reject hypotheses with observed counterexamples 2) Analysis • Critical path, slack, anomaly detection, what-if Michael Chow

  12. Step 1: Identify segments Michael Chow

  13. Define a minimal schema Task Segment 1 Segment 2 Event Event Event Michael Chow

  14. Define a minimal schema Task Segment 1 Segment 2 Event Event Event Request identifier Machine identifier Timestamp Task Event Aggregate existing logs using minimal schema Michael Chow

  15. Step 2: Infer causal model Michael Chow

  16. Types of causal relationships A B B A A B A B OR B A A A’ C’ A B’ B B B’ C’ C A’ C t1 t1 t2 t2 Michael Chow

  17. Producing causal model S1 N1 C1 Causal Model S2 N2 C2 Michael Chow

  18. Producing causal model S1 N1 C1 Causal Model S2 N2 C2 S1 N1 C1 Trace 1 S2 N2 C2 Time Michael Chow

  19. Producing causal model S1 N1 C1 Causal Model S2 N2 C2 S1 N1 C1 Trace 1 S2 N2 C2 Time Michael Chow

  20. Producing causal model S1 N1 C1 Causal Model S2 N2 C2 S1 N1 C1 Trace 2 S2 N2 C2 Time Michael Chow

  21. Producing causal model S1 N1 C1 Causal Model S2 N2 C2 S1 N1 C1 Trace 2 S2 N2 C2 Time Michael Chow

  22. Producing causal model S1 N1 C1 Causal Model S2 N2 C2 S1 N1 C1 Trace 2 S2 N2 C2 Time Michael Chow

  23. Step 3: Analyze individual requests Michael Chow

  24. Critical path using causal model S1 N1 C1 S2 N2 C2 S1 N1 C1 Trace 1 S2 N2 C2 Time Michael Chow

  25. Critical path using causal model S1 N1 C1 S2 N2 C2 S1 N1 C1 Trace 1 S2 N2 C2 Michael Chow

  26. Critical path using causal model S1 N1 C1 S2 N2 C2 S1 N1 C1 Trace 1 S2 N2 C2 Michael Chow

  27. Step 4: Aggregate results Michael Chow

  28. Inaccuracies of Naïve Aggregation Michael Chow

  29. Inaccuracies of Naïve Aggregation Michael Chow

  30. Inaccuracies of Naïve Aggregation Need a causal model to correctly understand latency Michael Chow

  31. High variance in critical path Client Server Network cdf Percent of critical path Percent of critical path Percent of critical path • Breakdown in critical path shifts drastically • Server, network, or client can dominate latency Michael Chow

  32. High variance in critical path Client Server Network cdf 20% of requests, server contributes 10% of latency Percent of critical path Percent of critical path Percent of critical path • Breakdown in critical path shifts drastically • Server, network, or client can dominate latency Michael Chow

  33. High variance in critical path 20% of requests, server contributes 50% or more of latency Client Server Network cdf Percent of critical path Percent of critical path Percent of critical path • Breakdown in critical path shifts drastically • Server, network, or client can dominate latency Michael Chow

  34. Diverse clients and networks Client Network Server Client Network Server Client Network Server Michael Chow

  35. Diverse clients and networks Client Network Server Client Network Server Client Network Server Michael Chow

  36. Diverse clients and networks Client Network Server Client Network Server Client Network Server Michael Chow

  37. Differentiated service Slack in server generation time Produce data slower End-to-end latency stays same No slack in server generation time Produce data faster Decrease end-to-end latency Deliver data when needed and reduce average response time Michael Chow

  38. Additional analysis techniques S1 N1 C1 S2 N2 C2 Time • Slack analysis • What-if analysis • Use natural variation in large data set Michael Chow

  39. What-if questions Does server generation time affect end-to-end latency? Can we predict which connections exhibit server slack? Michael Chow

  40. Server slack analysis Slack < 25ms Slack > 2.5s Michael Chow

  41. Server slack analysis Slack < 25ms Slack > 2.5s End-to-end latency increases as server generation time increases Server generation time has little effect on end-to-end latency Michael Chow

  42. Predicting server slack Second slack (ms) First slack (ms) Predict slack at the receipt of a request Past slack is representative of future slack Michael Chow

  43. Predicting server slack Classifies 83% of requests correctly Type II Error 9% Second slack (ms) Type I Error 8% First slack (ms) Predict slack at the receipt of a request Past slack is representative of future slack Michael Chow

  44. Conclusion Michael Chow

  45. Questions Michael Chow

More Related