Extensible Distributed Tracing from Kernels to Clusters

Fay Extensible Distributed Tracing from Kernels to Clusters Úlfar Erlingsson, Google Inc. Marcus Peinado, Microsoft Research Simon Peter, Systems Group, ETH Zurich MihaiBudiu, Microsoft Research

Wouldn’t it be nice if… • We could know what our clusters were doing? • We could ask any question, … easily, using one simple-to-use system. • We could collect answers extremely efficiently … so cheaply we may even ask continuously.

Let’s imagine... • Applying data-mining to cluster tracing • Bag of words technique • Compare documents w/o structural knowledge • N-dimensional feature vectors • K-means clustering • Can apply to clusters, too!

Cluster-mining with Fay • Automatically categorize cluster behavior, based on system call activity

Cluster-mining with Fay • Automatically categorize cluster behavior, based on system call activity • Without measurable overhead on the execution • Without any special Fay data-mining support

Fay K-Means Behavior-Analysis Code var kernelFunctionFrequencyVectors = cluster.Function(kernel, “syscalls!*”) .Where(evt => evt.time < Now.AddMinutes(3)) .Select(evt => new { Machine = fay.MachineID(), Interval = evt.Cycles / CPS, Function = evt.CallerAddr }) .GroupBy(evt => evt, (k,g) => new { key = k, count = g.Count() }); Vector Nearest(Vector pt, Vectors centers) { var near = centers.First(); foreach (var c in centers) if (Norm(pt – c) < Norm(pt – near)) near = c; return near; } Vectors OneKMeansStep(Vectors vs, Vectors cs) { return vs.GroupBy(v => Nearest(v, cs)) .Select(g => g.Aggregate((x,y) => x+y)/g.Count()); } Vectors KMeans(Vectors vs, Vectors cs, int K) { for (int i=0; i < K; ++i) cs = OneKMeansStep(vs, cs); return cs; }

Fay K-Means Behavior-Analysis Code var kernelFunctionFrequencyVectors = cluster.Function(kernel, “syscalls!*”) .Where(evt => evt.time < Now.AddMinutes(3)) .Select(evt => new { Machine = fay.MachineID(), Interval = evt.Cycles / CPS, Function = evt.CallerAddr }) .GroupBy(evt => evt, (k,g) => new { key = k, count = g.Count() });

Fay vs. Specialized Tracing • Could’ve built a specialized tool for this • Automatic categorization of behavior (Fmeter) • Fay is general, but can efficiently do • Tracing across abstractions, systems (Magpie) • Predicated and windowed tracing (Streams) • Probabilistic tracing (Chopstix) • Flight recorders, performance counters, …

Key Takeaways Fay: Flexible monitoring of distributed executions • Can be applied to existing, live Windows servers • Single query specifies both tracing & analysis • Easy to write & enables automatic optimizations • Pervasively data-parallel,scalable processing • Same model within machines & across clusters • Inline, safe machine-code at tracepoints • Allows us to do computation right at data source

K-Means: Single, Unified Fay Query var kernelFunctionFrequencyVectors = cluster.Function(kernel, “*”) .Where(evt => evt.time < Now.AddMinutes(3)) .Select(evt => new { Machine = fay.MachineID(), Interval = evt.Cycles / CPS, Function = evt.CallerAddr}) .GroupBy(evt => evt, (k,g) => new { key = k, count = g.Count() }); var kernelFunctionFrequencyVectors = cluster.Function(kernel, “*”) .Where(evt => evt.time < Now.AddMinutes(3)) .Select(evt => new { Machine = MachineID(), Interval = w.Cycles / CPS, Function = w.CallerAddr}) .GroupBy(evt => evt, (k,g) => new { key = k, count = g.Count() }); Vector Nearest(Vector pt, Vectors centers) { var near = centers.First(); foreach (var c in centers) if (|pt – c| < |pt – near|) near = c; return near; } Vector Nearest(Vector pt, Vectors centers) { var near = centers.First(); foreach (var c in centers) if (Norm(pt – c) < Norm(pt – near)) near = c; return near; } Vectors OneKMeansStep(Vectors vs, Vectors cs) { return vs.GroupBy(v => Nearest(v, cs)) .Select(g => g.Aggregate((x,y) => x+y)/g.Count()); } Vectors KMeans(Vectors vs, Vectors cs, int K) { for (int i=0; i < K; ++i) cs = OneKMeansStep(vs, cs); return cs; } Vectors OneKMeansStep(Vectors vs, Vectors cs) { return vs.GroupBy(v => Nearest(v, cs)) .Select(g => g.Aggregate((x,y) => x+y)/g.Count()); } Vectors KMeans(Vectors vs, Vectors cs, int K) { for (int i=0; i < K; ++i) cs = OneKMeansStep(vs, cs); return cs; }

Fay is Data-Parallel on Cluster • View trace query as distributed computation • Use cluster for analysis

Fay is Data-Parallel on Cluster • System call trace events • Fay does early aggregation & data reduction • Fay knows what’s needed for later analysis

Fay is Data-Parallel on Cluster • System call trace events • Fay does early aggregation & data reduction • K-Means analysis • Fay builds an efficient processing plan from query

Fay is Data-Parallel within Machines • Early aggregation • Inline, in OS kernel • Reduce dataflow & kernel/user transitions • Data-parallel per each core/thread

Processing w/o Fay Optimizations K-Means: System calls K-Means: Clustering • Collect data first (on disk) • Reduce later • Inefficient, can suffer data overload

Traditional Trace Processing K-Means: System calls K-Means: Clustering • First log all data (a deluge) • Process later (centrally) • Compose tools via scripting

Takeaways so far Fay: Flexible monitoring of distributed executions • Single query specifies both tracing & analysis • Pervasively data-parallel,scalable processing

Safety of Fay Tracing Probes • A variant of XFI used for safety [OSDI’06] • Works well in the kernel or any address space • Can safely use existing stacks, etc. • Instead of language interpreter (DTrace) • Arbitrary, efficient, statefulcomputation • Probes can access thread-local/global state • Probes can try to read any address • I/O registers are protected

Key Takeaways, Again Fay: Flexible monitoring of distributed executions • Single query specifies both tracing & analysis • Pervasively data-parallel,scalable processing • Inline, safe machine-code at tracepoints

Installing and Executing Fay Tracing • Fay runtime on each machine • Fay module in each traced address space • Tracepoints at hotpatched function boundary Tracing Runtime query Createprobe ETW User-Space Kernel Target Fay Probe XFI Hotpatching 200 cycles

Low-level Code Instrumentation Module with a traced function Foo Caller: ... e8ab62ffff call Foo ... ff1508e70600 call[Dispatcher] Foo: ebf8 jmp Foo-6 cccccc Foo2: 57 push rdi ... c3 ret • Replace 1stopcode of functions

Low-level Code Instrumentation Module with a traced function Foo Fay platform module Dispatcher: t = lookup(return_addr) ... call t.entry_probes ... call t.Foo2_trampoline ... call t.return_probes ... return /* to after call Foo */ Caller: ... e8ab62ffff call Foo ... ff1508e70600 call[Dispatcher] Foo: ebf8 jmp Foo-6 cccccc Foo2: 57 push rdi ... c3 ret • Replace 1stopcode of functions • Fay dispatcher called via trampoline

Low-level Code Instrumentation Module with a traced function Foo Fay platform module Fay probes Dispatcher: t = lookup(return_addr) ... call t.entry_probes ... call t.Foo2_trampoline ... call t.return_probes ... return /* to after call Foo */ Caller: ... e8ab62ffff call Foo ... ff1508e70600 call[Dispatcher] Foo: ebf8 jmp Foo-6 cccccc Foo2: 57 push rdi ... c3 ret PF3 XFI PF4 PF5 XFI XFI • Replace 1stopcode of functions • Fay dispatcher called via trampoline • Fay calls the function, and entry & exit probes

What’s Fay’s Performance & Scalability? • Fay adds 220 to 430 cycles per traced function • Fay adds 180% CPU to trace all kernel functions • Both approx 10x faster than Dtrace, SystemTap Slowdown (x) Null-probe overhead Cycles

Fay Scalability on a Cluster • Fay tracing memory allocations, in a loop: • Ran workload on a 128-node, 1024-core cluster • Spread work over 128 to 1,280,000 threads • 100% CPU utilization • Fay overhead was 1% to 11% (mean 7.8%)

More Fay Implementation Details • Details of query-plan optimizations • Case studies of different tracing strategies • Examples of using Fay for performance analysis • Fay is based on LINQ and Windows specifics • Could build on Linux using Ftrace, Hadoop, etc. • Some restrictions apply currently • E.g., skew towards batch processing due to Dryad

Conclusion • Fay: Flexible tracing of distributed executions • Both expressiveand efficient • Unified trace queries • Pervasive data-parallelism • Safe machine-code probe processing • Often equally efficient as purpose-built tools

Backup

A Fay Trace Query from ioin cluster.Function("iolib!Read") where io.time < Now.AddMinutes(5) let size = io.Arg(2) // request size in bytes group ioby size/1024 into g select new { sizeInKilobytes = g.Key, countOfReadIOs = g.Count() }; • Aggregates read activity in iolib module • Across cluster, both user-mode & kernel • Over 5 minutes

A Fay Trace Query from ioin cluster.Function("iolib!Read") where io.time < Now.AddMinutes(5) let size = io.Arg(2) // request size in bytes group ioby size/1024 into g select new { sizeInKilobytes = g.Key, countOfReadIOs = g.Count() }; • Specifies what to trace • 2nd argument of read function in iolib • And how to aggregate • Group into kb-size buckets and count

Extensible Distributed Tracing from Kernels to Clusters

Extensible Distributed Tracing from Kernels to Clusters

Presentation Transcript

From local to global : ray tracing

Introduction to Kernels

Extensible Kernels

Extensible Kernels

Distributed Ray Tracing Part 1

Distributed Ray Tracing Part 2

Extensible Kernels : Exokernel and SPIN

From clusters of particles to 2D bubble clusters

Distributed Ray Tracing

Introduction to Kernels

Introduction to Kernels

Distributed Ray Tracing

Distributed Ray Tracing

From Clusters to Grids

Ultimate Extensible Distributed System

Distributed Ray Tracing

Extensible Kernels

DISTRIBUTED RAY TRACING

Distributed Ray Tracing

Extensible Scalable Monitoring for Clusters of Computers