Distributed Debugging

Distributed Debugging Presenter: Chi-Hung Lu

Problems • Distributed applications are hard to validate • Distribution of application state across many distinct execution environments • Protocols involve complex interactions among a collection of networked machines • Need to handle failures ranging from network problems to crashing nodes • Intricate sequences of events can trigger complex errors as a result of mishandled corner cases

Approaches • Logging-based Debugging • X-Trace • Bi-directional Distributed BackTracker (BDB) • Pip • Deterministic Replay • WiDS • Friday • Jockey • Model Checking • MaceMC

X-Trace: A Pervasive Network Tracing Framework R. Fonseca et al, NSDI 07

Problem Description • It is difficult to diagnose the source of the problem for an internet application • Current network diagnostic tools only focus on one particular protocol • Does not share information on the application between the user, service, and the network operators

Examples • traceroute • Could locate IP connectivity problem • Could not reveal proxy or DNS failures • HTTP monitoring suite • Could locate application problem • Could not diagnose routing problems

Examples DNS Server User Web Server Proxy

X-Trace • An integrated tracing framework • Record the network path that were taken • Invoke X-Trace when initiating an application task • Insert X-Trace metadata with a task identifier in the request • Propagate the metadata down to lower layers through protocol interfaces

Task Tree • X-Trace tags all network operations resulting from a particular task with the same task identifier • Task tree is the set of network operations connected with an initial task • Task tree could be reconstruct after collecting trace data with reports

An example of the task tree • A simple HTTP request through a proxy

X-Trace Components • Data • X-Trace metadata • Network path • Task tree • Report • Reconstruct task tree

Propagation of X-Trace Metadata • The propagation of X-Trace metadata through the task tree

The X Trace metadata

Operation of X-Trace Metadata

X-Trace Report Architecture

Usage Scenario (1) • Web request and recursive DNS queries

Usage Scenario (2) • A request fault annotated with user input

Usage Scenario (3) • A client and a server communicate over I3 overlay network

Usage Scenario (3) • Internet Indirect Infrastructure (I3)

Usage Scenario (3) • Tree for normal operation

Usage Scenario (3) • The receiver host fails

Usage Scenario (3) • Middlebox process crash

Usage Scenario (3) • The middlebox host fails

Discussion • Report loss • Non-tree request structures • Partial deployment • Managing report traffic • Security Considerations

WiDS Checker: Combating Bugs in Distributed Systems X. Liu et al, NSDI 07

Problem Description • Log mining is both labor-intensive and fragile • Latent bugs often are distributed across multiple nodes • Logs reflect incomplete information of an execution • Non-determinism of distributed application

Goals • Efficiently verify application properties • Provide fairly complete information about an execution • Reproduce the buggy runs deterministically and faithfully

Approach • Log the actual execution of a distributed system • Apply predicate checking in a centralized simulator over a run driven by testing scripts or replayed by logs • Output violation report along with message traces • An execution is interpreted as a sequence of events, which are dispatched to corresponding handling routines

Components • A versatile script language • Allow a developer to refine system properties into straightforward assertions • A checker • Inspect for violations

Architecture • Components of WiDS Checker

Architecture • Reproduce real runs • Log all non-deterministic events using Lamport’s logical clock • Check user-defined predicates • A versatile scription language to specify system states being observed and the predicates for invariants and correctness • Screen out false alarms with auxiliary information • For liveness properties • Trace root causes using a visualization tool

Programming with WiDS • WiDS APIs are mostly member function of the WiDSObject class • WiDS runtime maintains an event queue to buffer pending events and dispatches them to corresponding handling routines

Enabling Replay • Logging • Log all WiDS nondeterminism • Redirect OS calls and log the results • Embed a Lamport Clock in each out-going message • Checkpoint • Support partial replay • Save the WiDS process context • Replay • Start from the beginning or a checkpoint • Replay events in serialized Lamport order

Checker • Observe memory state • Define states and evaluate predicates • Refresh database for each event • Maintain history • Re-evaluate modified predicates • Auxiliary information for violations • Liveness properties only guarantee to be true eventually

Visualization Tools • Message flow graph

Evaluation • Benchmark and result summary

Performance • Running time for evaluating predicates

Logging Overhead • Percentage of logging time

Distributed Debugging

Distributed Debugging

Presentation Transcript

Debugging

Debugging

D 3 S: Debugging Deployed Distributed Systems

Performance Debugging for Distributed Systems of Black Boxes

Debugging

Replay Debugging for Distributed Application

Debugging

D 3 S: Debugging Deployed Distributed Systems

Performance Debugging for Distributed Systems of Black Boxes

Performance Debugging for Distributed Systems of Black Boxes

declarative distributed debugging

Performance Debugging for Distributed Systems of Black Boxes

Distributed Application Analysis and Debugging using NetLogger v2

Debugging

Debugging

Performance Debugging for Distributed Systems of Black Boxes

Debugging !!! 

Debugging

Debugging

Debugging

Replay Debugging for Distributed Application