1 / 34

Pip: Detecting the Unexpected in Distributed Systems

Pip: Detecting the Unexpected in Distributed Systems. Patrick Reynolds reynolds@cs.duke.edu Collaborators:. Charles Killian Amin Vahdat UCSD. Janet Wiener Jeffrey Mogul Mehul Shah HP Labs. http://issg.cs.duke.edu/pip/. Introduction. Distributed systems are essential to the Internet

Download Presentation

Pip: Detecting the Unexpected in Distributed Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pip: Detecting the Unexpected in Distributed Systems • Patrick Reynolds • reynolds@cs.duke.edu • Collaborators: • Charles Killian • Amin Vahdat • UCSD • Janet Wiener • Jeffrey Mogul • Mehul Shah • HP Labs http://issg.cs.duke.edu/pip/

  2. Introduction • Distributed systems are essential to the Internet • Distributed systems are complex • Harder to debug than centralized systems • Pip uses causal paths to find bugs • Characterize system behavior • Deviations from expected behavior often indicate bugs • Experimental results show that Pip is effective • Found 19 bugs in 6 test systems NSDI - May 8th, 2006

  3. Challenges of debugging distributed systems • Distributed systems have more bugs • Parallelism is hard • New sources of failure • More components = more faults • Network errors • Security breaches • Finding bugs is harder • More nodes, events, and messages to keep track of • Applications may cross administrative domains • Bugs on one node may be caused by events on another NSDI - May 8th, 2006

  4. 500ms 2000 page faults Causal path analysis • Programmers wish to examine and check system-wide behaviors • Causal paths • Components of end-to-end delay • Attribution of resource consumption • Unexpected behavior might indicate a bug • Pip compares actual behavior to expected behavior Web server App server Database NSDI - May 8th, 2006

  5. Web server App server 500ms Database 2000 page faults What kinds of bugs? • Structure bugs • Incorrect placement/timing of processing and communication • Performance bugs • Throughput bottlenecks • Over- or under-consumption of resources • Bugs present in an actual run of the system • Selecting input for path coverage is orthogonal NSDI - May 8th, 2006

  6. Three target audiences • Primary programmer • Debugging or optimizing his/her own system • Secondary programmer • Inheriting a project or joining a programming team • Discovery: learning how the system behaves • Operator • Monitoring a running system for unexpected behavior • Performing regression tests after a change NSDI - May 8th, 2006

  7. Outline • Introduction • Pip • Expressing expected behavior • Exploring application behavior • Results • Conclusions NSDI - May 8th, 2006

  8. Behavior model Expectations Pip checker Unexpected structure Resource violations Pip explorer: visualization GUI Workflow Pip: • Captures events from a running system • Reconstructs behavior from events • Checks behavior against expectations • Displays unexpected behavior Both structure and resource violations Goal: help programmers locate and explain bugs Application NSDI - May 8th, 2006

  9. Parse HTTP Send response Run application Query Describing application behavior • Application behavior consists of paths • All events, on any node, related to one high-level operation • Definition of a path is programmer defined • Path is often causal, related to a user request WWW App srv DB time NSDI - May 8th, 2006

  10. “Request = /cgi/…” “2096 bytes in response” “done with request 12” Parse HTTP Send response Run application Query Describing application behavior • Within paths are tasks, messages, and notices • Tasks: processing with start and end points • Messages: send and receive events for any communication • Includes network, synchronization (lock/unlock), and timers • Notices: time-stamped strings; essentially log entries WWW App srv DB time NSDI - May 8th, 2006

  11. Outline • Introduction • Pip • Expressing expected behavior • Exploring application behavior • Results • Conclusions NSDI - May 8th, 2006

  12. R1 R2 R3 ?  Expectations • Pip has two kinds of expectations • Recognizers classify paths into sets • Aggregates assert properties of sets NSDI - May 8th, 2006

  13. Expectations: recognizers • Application behavior consists of paths • Each recognizer matches paths • A path can match more than one recognizer • A recognizer can be a validator, an invalidator, or neither • Any path matching zero validators or at least one invalidator is unexpected behavior:bug? validator CGIRequest thread WebServer(*, 1) task(“Parse HTTP”) limit(CPU_TIME, 100ms); notice(m/Request URL: .*/); send(AppServer); recv(AppServer); invalidator DatabaseError notice(m/Database error: .*/); NSDI - May 8th, 2006

  14. Expectations: recognizers language • repeat: matches a ≤ n ≤ b copies of a block • xor: matches any one of several blocks • future: lets a block match now or later • done: forces the named block to match repeat between 1 and 3 { … } xor { branch: … branch: … } future F1 { … } … done(F1); NSDI - May 8th, 2006

  15. Expectations: aggregate expectations • Recognizers categorize paths into sets • Aggregates make assertions about sets of paths • Instances, unique instances, resource constraints • Simple math and set operators assert(instances(CGIRequest) > 4); assert(max(CPU_TIME, CGIRequest) < 500ms); assert(max(REAL_TIME, CGIRequest) <= 3*avg(REAL_TIME, CGIRequest)); NSDI - May 8th, 2006

  16. Behavior model Automating expectations and annotations • Expectations can be generated from behavior model • Create a recognizer for each actual path • Eliminate repetition • Strike a balance between over- and under-specification • Annotations can be generated by middleware • As in several of our test systems Application Annotations Expectations Pip checker Unexpected behavior NSDI - May 8th, 2006

  17. Exploring behavior • Expectations checker generates lists of valid and invalid paths • Explore both sets • Why did invalid paths occur? • Is any unexpected behavior misclassified as valid? • Insufficiently constrained expectations • Pip may be unable to express all expectations NSDI - May 8th, 2006

  18. Visualization: causal paths Timing and resource properties for one task Causal view of path Caused tasks, messages, and notices on that thread NSDI - May 8th, 2006

  19. Visualization: timelines • View one path instance as a timeline • Different colors = different hosts • Each bar = task • Click a bar to see task details NSDI - May 8th, 2006

  20. Visualization: performance graphs • Plot per-task or per-path resource metrics • Cumulative distribution (CDF) • Probability density (PDF) • Changing values over time • Click on a point to see its value and the task/path represented Delay (ms) Time (s) NSDI - May 8th, 2006

  21. Results • We have applied Pip to several distributed systems: • FAB: distributed block store • SplitStream: DHT-based multicast protocol • Others: RanSub, Bullet, SWORD, Oracle of Bacon • We have found unexpected behavior in each system • We have fixed bugs in some systems … and used Pip to verify that the behavior was fixed NSDI - May 8th, 2006

  22. Results: SplitStreamDHT-based multicast protocol 13 bugs found, 12 fixed • 11 found using expectations, 2 found using GUI • Structural bug: some nodes have up to 25 children when they should have at most 18 • This bug was fixed and later reoccurred • Root cause #1: variable shadowing • Root cause #2: failed to register a callback • How discovered: first in the explorer GUI, confirmed with automated checking NSDI - May 8th, 2006

  23. Results: FABDistributed block store 1 bug found, fixed • Four protocols checked: read, write, Paxos, membership • Performance bug: quorum operations call nodes in arbitrary order • Should overlap computation when possible NSDI - May 8th, 2006

  24. Results: FABDistributed block store • For disk-bound workloads, call self last • For cache-bound workloads, call self second • Actually, last in quorum NSDI - May 8th, 2006

  25. Conclusions • Causal paths are a useful abstraction of distributed system behavior • Expectations serve as a high-level description • Summary of inter-component behavior and timing • Regression test for structure and performance • Finding unexpected behavior can help us find bugs • Both structure and performance bugs • Visualization can expose additional bugs http://issg.cs.duke.edu/pip/ NSDI - May 8th, 2006

  26. Extra slides • Related work • Pip results: RanSub • Pip vs. printf • Pip resource metrics • Building a behavior model • Annotations • Checking expectations • Visualization: communication graph NSDI - May 8th, 2006

  27. Related work • Causal paths • Pinpoint [Chen, 2004] • Magpie [Barham, 2004] • Black-box inferences • Finding stepping stones [Zhang, 1997] • Wavelets to debug network performance [Huang, 2001] • Expectations-checking systems • PSpec [Perl, 1993] • Meta-level compilation [Engler, 2000] • Paradyn [Miller, 1995] • Model checking • MaceMC [Killian, 2006] • VeriSoft [Godefroid, 2005] NSDI - May 8th, 2006

  28. Pip results: RanSub 2 bugs found, 1 fixed • Structural bug: during first round of communication, parent nodes send summary messages before hearing from all children • Root cause: uninitialized state variables • Performance bug: linear increase in end-to-end delay for the first ~2 minutes • Suspected root cause: data structure listing all discovered nodes NSDI - May 8th, 2006

  29. Pip vs. printf • Both record interesting events to check off-line • Pip imposes structure and automates checking • Generalizes ad hoc approaches NSDI - May 8th, 2006

  30. Pip resource metrics • Real time • User time, system time • CPU time = user + system • Busy time = CPU time / real time • Major and minor page faults (paging and allocation) • Voluntary and involuntary context switches • Message size and latency • Number of messages sent • Causal depth of path • Number of threads, hosts in path NSDI - May 8th, 2006

  31. Building a behavior model Sources of events: • Annotations in source code • Programmer inserts statements manually • Annotations in middleware • Middleware inserts annotations automatically • Faster and less error-prone • Passive tracing or interposition • Easier, but less information • Or any combination of the above Model consists of paths constructed from events recorded by the running application NSDI - May 8th, 2006

  32. “Request = /cgi/…” “2096 bytes in response” “done with request 12” Parse HTTP Send response Run application Query Annotations • Set path ID • Start/end task • Send/receive message • Notice WWW App srv DB time NSDI - May 8th, 2006

  33. Match start/end task, send/receive message Organize events into causal paths For each path P For each recognizer R Does R match P? Check each aggregate Checking expectations Application Traces Reconciliation Events database Path construction Paths Expectations Expectation checking Categorized paths NSDI - May 8th, 2006

  34. Visualization: communication graph • Graph view of all host-to-host network traffic • Nodes = hosts, edges = network links in use • Directed edge = unidirectional communication NSDI - May 8th, 2006

More Related