1 / 18

Pinpoint: Problem Determination in Large, Dynamic Internet Services

Pinpoint: Problem Determination in Large, Dynamic Internet Services. M. Chen, E. Kıcıman, E. Fratkin, A. Fox and E. Brewer. Presenter Spiros Xanthos. Motivation. “Traditional” root cause analysis techniques are based on static dependency models

adrina
Download Presentation

Pinpoint: Problem Determination in Large, Dynamic Internet Services

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pinpoint: Problem Determination in Large, Dynamic Internet Services M. Chen, E. Kıcıman, E. Fratkin, A. Fox and E. Brewer Presenter Spiros Xanthos

  2. Motivation • “Traditional” root cause analysis techniques are based on static dependency models • Difficult to generate and maintain an accurate model of a constantly evolving Internet service. • Application-level knowledge of the systems being monitored is required in order to identify faults • Dynamic Problem Determination Technique

  3. Outline • Data Clustering Approach • Request Tracing and Failure Detection • Statistical Analysis • Pinpoint Implementation • Evaluation • Limitations • Conclusions

  4. Data Clustering Approach • Dynamically trace real client requests through a system. • For each request, record its believed success or failure in each of the components • Perform data clustering to correlate the failures to the components that most likely to have caused them.

  5. Client Request Tracing • As a client request travels through the system, we are interested in recording all the components it uses • Each request is assigned an ID • Subsequently, the system can record the <ID, Component> for every request that arrive at a specific Component.

  6. Client Request Tracing Request #1 Request #2 Request #2 Request #1 Request #1 A B C Request #2 Internal Tracing Framework Trace Log <1,A> <1,B> <2,B> <2,C> Statistical Analysis … …

  7. Failure Detection • A failure detection system is attempting to detect whether the requests are successfully completing. • Internal Failure Detection • Failed assertions • Exceptions • External Failure Detection • Complete service failures, such as system crashes.

  8. Client Request Tracing Request #1 Request #2 Request #2 Request #1 Request #1 A B C Request #2 F/D System Internal Tracing Framework Fault Log Trace Log 1, Success <1,A> 2, Fail <1,B> 3, … <2,B> <2,C> Statistical Analysis … …

  9. Data Analysis - Clustering • Performing data clustering to correlate the failures of requests to the components most likely to have caused them. • Attempt to find the components or combinations of components that are most highly correlated with the failures of requests

  10. Pinpoint Implementation • Modified J2EE Reference Implementation with: • client request tracing and • simple fault detection. • Every client HTTP request gets a unique ID • Internal fault detection mechanism logs exceptions • External sniffer detects TCP errors such as resets and timeouts and HTTP errors (404 etc.)

  11. Evaluation • Used the J2EE PetStore demo application • Injected 4 different types of faults: • Declared exceptions, such as Java RemoteExceptions or IOExceptions. • Undeclared exceptions, such as runtime exceptions. • Infinite loops, where the called component never returns control to the callee. • Null calls, where the called component is never actually executed.

  12. Metrics • Accuracy: • 2 components, A and B, are interacting to cause a failure • Identifying only one or neither, would not be accurate, • How often the results are accurate. • Precision: • A and B, are interacting to cause a failure • Predicting the set {A, B, C, D, E} gives a precision of 40%.

  13. Compare to • Detection, returns the set of components recorded by the internal fault detection framework • Dependency Checking, returns the components that the failed requests use • A component is nominated as a potential fault if it occurs in more than some percentage of the failed requests. • Sensitivity to 0%, this technique returns only components that occurred in 100% of the failed requests

  14. Results • Pinpoint achieves an accuracy of 80-90% with a relatively high precision of 50-60%. Accuracy vs False Positive Rate, Overall

  15. Results Single components faults

  16. Limitations • Cannot distinguish between sets of components that are tightly coupled and used together • It does not work with faults that corrupt state and affect subsequent requests. • Deterministic failures due to pathological inputs cannot be distinguished from other failures

  17. More Limitations… • Cannot capture “fail-stutter” faults where components mask faults internally and exhibit only a decrease in performance. • Difficult to create Internal Fault Detectors, Java forces to write Exception handling code • More…

  18. Conclusions • Pinpoint is an effective problem determination technique. • Better than previous methods because it does not rely on static dependency models • Pinpoint, requires no application-level knowledge of the systems being monitored or any knowledge of the requests. • Still has some limitations: • tightly couple components, state corrupting faults

More Related