The Dryad ecosystem

The Dryad ecosystem Rebecca Isaacs Microsoft Research Silicon Valley

Outline • Introduction • Dryad • DryadLINQ • Vertex performance prediction • Scheduling for heterogeneous clusters • Causal models for performance debugging • Support software • Quincy scheduler • TidyFS distributed filesystem • Artemis monitoring system • Nectar data management • Current status

Data-parallel programming • Partition large data sets and process the pieces in parallel • Programming frameworks have made this easy • The execution environment (eg Dryad, Hadoop) deals with scheduling of tasks, movement of data, and fault tolerance • A high level language (eg DryadLINQ, PigLatin) allows the programmer to express the parallelism in a declarative fashion

Dryad (Isard et al, EuroSys 07) • Generalized MapReduce • Programs are dataflow graphs (dags) • Vertices (nodes) connected by channels (edges) • Channels are implemented as shared-memory FIFOs, TCP streams or files • The scheduler dispatches vertices onto machines to run the program

Dryad components Data plane Job schedule Files, FIFO, Network V V V JM D D D Control plane

Dryad computations Input files R R R R Stage X X X X X X M M M M Channels Vertices (processes) M M Output files

DryadLINQ(Yu et al, OSDI 08) • LINQ is a set of .NET constructs for programming with datasets • Relational databases, XML, ... • Supported by new language features in C#, Visual Basic, F# • Lazy evaluation on the data source • DryadLINQ extends LINQ with • Partitioned datasets • Some additional operators • Compilation of LINQ expressions into data-parallel operations expressed as a Dryad dataflow graph

DryadLINQ example • Join: find the lines in a file that start with one of the keywords in a second file DryadTable<LineRecord> table = ddc.GetPartitionedTable<LineRecord>(mydata); DryadTable<LineRecord> keywords = ddc.GetTable<LineRecord>(keys); IQueryable<LineRecord> matches = table.Join(keywords, l1 => l1.line.Split(' ').First(), /* first key */ l2 => l2.line, /* second key */ (l1, l2) => l1); /* keep first line */

Dryad execution graph for join Data file has 2 partitions Keys file has 1 partition Work is distributed: each word is sent to a machine based on its hash function 2 partitions for the output file

Outline • Introduction • Dryad • DryadLINQ • Vertex performance prediction • Scheduling for heterogeneous clusters • Causal models for performance debugging • Support software • Quincy scheduler • TidyFS distributed filesystem • Artemis monitoring system • Nectar data management • Current status Joint work with Paul Barham and Richard Black, MSR Cambridge/Silicon Valley Simon Peter and Timothy Roscoe, ETH Zurich

How do vertices behave? • Use Performance Analyzer tool from Microsoft(search for xperf on MSDN) • Detail view of one vertex • “Select” operation • Reads and writes 1 million 1K records on local disk

Select vertex, version 1 • Hardware: • 2 quad-core processors, 2.66GHz • 2 disks configured as a striped volume

Disk2 utilization Disk1 utilization CPU utilization Reads (Red) Writes (Blue)

Select vertex, version 2 • Hardware: • 1 quad-core processor, 2GHz • 1 disk

Disk utilization CPU utilization Reads (Red) Writes (Blue)

View of thread activity Data is read and then written in batches Other threads pick up the IO completions, sometimes issuing writes Reader thread NB Processors are 95% idle during execution of this vertex

Observations • The bottleneck resource changes every few seconds • And may not be 100% utilized • Vertices are multi-threaded, consuming multiple resources simultaneously • Dryad is engineered for throughput • Sequential I/O • Batched in 256KB chunks • Requests are pipelined, typically 4+ deep • Most DryadLINQ vertices are standard data processing operators with predictable behaviour

Factors affecting vertex execution times • Hardware: • CPU speed • Number of CPUs • Disk transfer rate • Network transfer rate • Workload: • I/O size • We assume file access patterns stay the same • Placement relative to parent(s): • Channels can be local (read from or write to disk) • Or remote (read from remote or local file via SMB)

Key idea: identify vertex phases • Trace a reference execution of the vertex • Identify phases within which resource demands are consistent • Phase boundaries are when the resource demands change • E.g. start reading, stop reading, etc. • Similar phases, in terms of resource consumption are grouped together

Phases in the Select vertex

Phases in the Select vertex Dcpu= 40ms Dcpu=70ms Ddisk = 30ms Dcpu = 20ms Ddisk = 40ms

Predicting phase runtimes • Each phase has the attributes: • Type: read, write, both, compute, overhead • “Concurrency histogram” • File being read/written • Number of bytes read/written i.e. demands on each resource • Simple operational laws can be applied to each phase individually • Can predict its runtime on different hardware

Expectations of accuracy • Inherent variability in running times: • Layout of file on disk • Inner or outer track • File fragmentation • Background processes • Logging and scanning services • Unanticipated network effects • Model deficiencies • Memory contention • Caching • Garbage collection • Prediction within 30% of actual would be good...

Prediction accuracy evaluation Merge vertex, 1 input 1 output, average over 10 runs Predict running time on different hardware

The parallelism spectrum core1 core2 coren shared memory disk shared memory multiprocessor homogeneous clusters, data centres small, heterogeneous clusters

Ad-hoc clusters • Small, heterogeneous clusters are everywhere • In the workplace • In my house... and yours? • Could be pretty useful for data-parallel programming • Data mining • Video editing • Scientific applications • ...

A data-parallel programming framework for ad-hoc clusters? • Why? • Exploit unused machines with no hardcoded assumptions about hardware and availability • “Easy” to write and run the code • Why not? • Heterogeneity: the wrong schedule can make it go badly wrong • Built-in assumptions about failure don’t apply • Our solution: • Construct vertex performance models • Apply a constraint-based search procedure to find a good assignment of vertices to the physical computers

Default scheduling in Dryad • DryadLINQ compiler creates an XML description of the vertices and how they are connected • JobManager places the vertices on available nodes according to constraints specified in the XML file • Greedy scheduling approach • Programmer and/or DryadLINQ compiler can provide hints

Heterogeneity can cause problems for greedy scheduling

Add a performance-aware planner to the end-to-end picture XML graph Updated XML graph Performance planner Vertex phase summary Vertex phase analyser CPU and I/O log Logging service on each node

Planning algorithm • Implemented with a constraint logic programming system (ECLiPSE) • Constraints prune the search space • Heuristics reduce search time • Eg decide where to place longest running vertices first • Greedy schedule gives upper bound

Contention between vertices Join(1) Hash(1) Merge(1) Hash(0) Merge(0) Join(0) 50 50 100 100 150 150 200 200 Without Contention model Join(1) Hash(1) Merge(1) Merge(1) Hash(0) Merge(0) Join(0) Merge(0) With Contention model

Workloads for experimental eval

Physical config of cluster 3 machines, quite heterogeneous

Overall speed-up vs greedy

Edison • New project • Position paper to appear at HotOS 11 • Joint work with Moises Goldszmidt • Performance problems in Dryad clusters • Resource contention • Data or computation skew • Hardware issues • Often transient • Use active intervention • Re-run the vertex in a sandbox on the cluster • Construct experiments using its causal model • Systematically probe behavior: fix some variables while altering others

Circuit blueprint • Given partial observations, lets us make inferences about the state of the circuit • If we intervene and fix some inputs, lets us make inferences about the state of the circuit G3 G4 G1 A G2 B

Blueprint of a vertex Disk congestion Net congestion Phase Rate data_in Data size CPU Reading time Computing time Blueprint for inferring State from both Observations and Interventions Running time Answer “what-if” questions Root-cause analysis

Overview • Introduction • Dryad • DryadLINQ • Vertex performance prediction • Scheduling for heterogeneous clusters • Causal models for performance debugging • Support software • Quincy scheduler • TidyFS distributed filesystem • Artemis monitoring system • Nectar data management • Current status

Quincy (Isard et al, SOSP 09) • Need to share the cluster between multiple, concurrently executing jobs • Goals are fairness and data locality • If job x takes t seconds when it is run exclusively on the cluster then x should take no more than Jt seconds when the cluster has J jobs • Very large datasets stored on the cluster itself means that unnecessary data movement is costly • These goals are conflicting • Optimal data locality => delay job until resources available • Fairness => allocate resources as soon as available

Quincy (cont) • Strategies for fairness: • Sacrifice locality • Kill already running jobs • Admission control • Fairness is achieved at a cost to throughput

Quincy (cont) • Quantify every scheduling decision • Data transfer cost • Cost in wasted time if a task is killed • Express scheduling problem as a flow network • Represent all worker tasks that are ready to run with preferred locations, and all currently running tasks • Edge weights and capacities encode the scheduling policy • Produces a set of scheduling assignments for all jobs at once that satisfy a global criterion • Solve online with standard min-cost flow algorithm • Graph is updated whenever anything changes

TidyFS(Fetterly et al, USENIX 11) • Simple distributed file system • Like HDFS or GFS • Highly optimized to perform well for data-parallel computations: • Data streams are striped across cluster nodes • Stream parts are read or written in parallel • By a single process • I/O is sequential for high throughput • Streams are replaced rather than modified • In case of failure, missing parts of output streams can easily be regenerated

TidyFS (cont) • Data streams contain parts • Parts are replicated lazily • Failure before replication is complete is handled by Dryad regenerating the missing part(s) • Parts are “native” files, eg NTFS files or SQL Server database • Read and written using native APIs • Centralized meta-data server • Replicated for fault tolerance • Replicas synchronize using Paxos

Artemis (Crețu-Ciocârlie et al, WASL 08) • Management and analysis of Dryad logs • Each vertex produces around 1MB/s/process • A single Dryad job can easily produce >1TB of log data • Runs and logs continuously on the cluster • To locate and collate log data for a particular job, itself runs a DryadLINQ computation on the cluster • Combines job manager and vertex logs with over 80 Windows performance counters • Sophisticated GUI for post-processing and visualization • Histograms and time series especially helpful for performance debugging

Nectar (Gunda et al, OSDI 10) • Key idea: Data and the computation that generates it are interchangeable • Datasets are uniquely identified by the programs that produce them • Automatic data management • Cluster-wide caching service • Re-use of common datasets to save computation and space • Garbage collection of obsolete datasets • Data can be transparently regenerated • Nectar client service interposes on the DryadLINQ compiler • Consults Nectar cache server and rewrites the program appropriately

Current status • Imminent release of Dryad and DryadLINQ on Windows HPC • Uses HPC scheduler and other cluster services • Includes TidyFSas DSC (Distributed Storage Catalog) • Can download and try Technology Preview • Ongoing research… • Naiad: allowing cycles in Dryad graphs • Strongly connected components: more general programming model • Loops and convergence tests without the need for driver programs • Continuous queries on streaming data

The Dryad ecosystem

The Dryad ecosystem

Presentation Transcript

Dryad / DryadLINQ

The Ecosystem

Matrix Multiply with Dryad

Distributed computing using Dryad

Introduction to the Dryad Digital Repository

The Ecosystem

Dryad and DryadLINQ

Dryad and dataflow systems

The Dryad Digital Repository

Dryad

Dryad and DryaLINQ

HIVE-DRYAD Integration

The Dryad-UK vision

The Dryad Data Repository

The Ecosystem

Dryad and DryadLINQ

The Ecosystem

DRYAD GRAPHICS M98/006

Dryad and DryadLINQ

Overview of Dryad Curation

UKPMC and Dryad

The Dryad-UK vision