1 / 51

The Dryad ecosystem

The Dryad ecosystem. Rebecca Isaacs Microsoft Research Silicon Valley. Outline. Introduction Dryad DryadLINQ Vertex performance prediction S cheduling for heterogeneous clusters Causal models for performance debugging Support software Quincy scheduler TidyFS distributed filesystem

basil-quinn
Download Presentation

The Dryad ecosystem

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Dryad ecosystem Rebecca Isaacs Microsoft Research Silicon Valley

  2. Outline • Introduction • Dryad • DryadLINQ • Vertex performance prediction • Scheduling for heterogeneous clusters • Causal models for performance debugging • Support software • Quincy scheduler • TidyFS distributed filesystem • Artemis monitoring system • Nectar data management • Current status

  3. Data-parallel programming • Partition large data sets and process the pieces in parallel • Programming frameworks have made this easy • The execution environment (eg Dryad, Hadoop) deals with scheduling of tasks, movement of data, and fault tolerance • A high level language (eg DryadLINQ, PigLatin) allows the programmer to express the parallelism in a declarative fashion

  4. Dryad (Isard et al, EuroSys 07) • Generalized MapReduce • Programs are dataflow graphs (dags) • Vertices (nodes) connected by channels (edges) • Channels are implemented as shared-memory FIFOs, TCP streams or files • The scheduler dispatches vertices onto machines to run the program

  5. Dryad components Data plane Job schedule Files, FIFO, Network V V V JM D D D Control plane

  6. Dryad computations Input files R R R R Stage X X X X X X M M M M Channels Vertices (processes) M M Output files

  7. DryadLINQ(Yu et al, OSDI 08) • LINQ is a set of .NET constructs for programming with datasets • Relational databases, XML, ... • Supported by new language features in C#, Visual Basic, F# • Lazy evaluation on the data source • DryadLINQ extends LINQ with • Partitioned datasets • Some additional operators • Compilation of LINQ expressions into data-parallel operations expressed as a Dryad dataflow graph

  8. DryadLINQ example • Join: find the lines in a file that start with one of the keywords in a second file DryadTable<LineRecord> table = ddc.GetPartitionedTable<LineRecord>(mydata); DryadTable<LineRecord> keywords = ddc.GetTable<LineRecord>(keys); IQueryable<LineRecord> matches = table.Join(keywords, l1 => l1.line.Split(' ').First(), /* first key */ l2 => l2.line, /* second key */ (l1, l2) => l1); /* keep first line */

  9. Dryad execution graph for join Data file has 2 partitions Keys file has 1 partition Work is distributed: each word is sent to a machine based on its hash function 2 partitions for the output file

  10. Outline • Introduction • Dryad • DryadLINQ • Vertex performance prediction • Scheduling for heterogeneous clusters • Causal models for performance debugging • Support software • Quincy scheduler • TidyFS distributed filesystem • Artemis monitoring system • Nectar data management • Current status Joint work with Paul Barham and Richard Black, MSR Cambridge/Silicon Valley Simon Peter and Timothy Roscoe, ETH Zurich

  11. How do vertices behave? • Use Performance Analyzer tool from Microsoft(search for xperf on MSDN) • Detail view of one vertex • “Select” operation • Reads and writes 1 million 1K records on local disk

  12. Select vertex, version 1 • Hardware: • 2 quad-core processors, 2.66GHz • 2 disks configured as a striped volume

  13. Disk2 utilization Disk1 utilization CPU utilization Reads (Red) Writes (Blue)

  14. Select vertex, version 2 • Hardware: • 1 quad-core processor, 2GHz • 1 disk

  15. Disk utilization CPU utilization Reads (Red) Writes (Blue)

  16. View of thread activity Data is read and then written in batches Other threads pick up the IO completions, sometimes issuing writes Reader thread NB Processors are 95% idle during execution of this vertex

  17. Observations • The bottleneck resource changes every few seconds • And may not be 100% utilized • Vertices are multi-threaded, consuming multiple resources simultaneously • Dryad is engineered for throughput • Sequential I/O • Batched in 256KB chunks • Requests are pipelined, typically 4+ deep • Most DryadLINQ vertices are standard data processing operators with predictable behaviour

  18. Factors affecting vertex execution times • Hardware: • CPU speed • Number of CPUs • Disk transfer rate • Network transfer rate • Workload: • I/O size • We assume file access patterns stay the same • Placement relative to parent(s): • Channels can be local (read from or write to disk) • Or remote (read from remote or local file via SMB)

  19. Key idea: identify vertex phases • Trace a reference execution of the vertex • Identify phases within which resource demands are consistent • Phase boundaries are when the resource demands change • E.g. start reading, stop reading, etc. • Similar phases, in terms of resource consumption are grouped together

  20. Phases in the Select vertex

  21. Phases in the Select vertex Dcpu= 40ms Dcpu=70ms Ddisk = 30ms Dcpu = 20ms Ddisk = 40ms

  22. Predicting phase runtimes • Each phase has the attributes: • Type: read, write, both, compute, overhead • “Concurrency histogram” • File being read/written • Number of bytes read/written i.e. demands on each resource • Simple operational laws can be applied to each phase individually • Can predict its runtime on different hardware

  23. Expectations of accuracy • Inherent variability in running times: • Layout of file on disk • Inner or outer track • File fragmentation • Background processes • Logging and scanning services • Unanticipated network effects • Model deficiencies • Memory contention • Caching • Garbage collection • Prediction within 30% of actual would be good...

  24. Prediction accuracy evaluation Merge vertex, 1 input 1 output, average over 10 runs Predict running time on different hardware

  25. Outline • Introduction • Dryad • DryadLINQ • Vertex performance prediction • Scheduling for heterogeneous clusters • Causal models for performance debugging • Support software • Quincy scheduler • TidyFS distributed filesystem • Artemis monitoring system • Nectar data management • Current status

  26. The parallelism spectrum core1 core2 coren shared memory disk shared memory multiprocessor homogeneous clusters, data centres small, heterogeneous clusters

  27. Ad-hoc clusters • Small, heterogeneous clusters are everywhere • In the workplace • In my house... and yours? • Could be pretty useful for data-parallel programming • Data mining • Video editing • Scientific applications • ...

  28. A data-parallel programming framework for ad-hoc clusters? • Why? • Exploit unused machines with no hardcoded assumptions about hardware and availability • “Easy” to write and run the code • Why not? • Heterogeneity: the wrong schedule can make it go badly wrong • Built-in assumptions about failure don’t apply • Our solution: • Construct vertex performance models • Apply a constraint-based search procedure to find a good assignment of vertices to the physical computers

  29. Default scheduling in Dryad • DryadLINQ compiler creates an XML description of the vertices and how they are connected • JobManager places the vertices on available nodes according to constraints specified in the XML file • Greedy scheduling approach • Programmer and/or DryadLINQ compiler can provide hints

  30. Heterogeneity can cause problems for greedy scheduling

  31. Add a performance-aware planner to the end-to-end picture XML graph Updated XML graph Performance planner Vertex phase summary Vertex phase analyser CPU and I/O log Logging service on each node

  32. Planning algorithm • Implemented with a constraint logic programming system (ECLiPSE) • Constraints prune the search space • Heuristics reduce search time • Eg decide where to place longest running vertices first • Greedy schedule gives upper bound

  33. Contention between vertices Join(1) Hash(1) Merge(1) Hash(0) Merge(0) Join(0) 50 50 100 100 150 150 200 200 Without Contention model Join(1) Hash(1) Merge(1) Merge(1) Hash(0) Merge(0) Join(0) Merge(0) With Contention model

  34. Workloads for experimental eval

  35. Physical config of cluster 3 machines, quite heterogeneous

  36. Overall speed-up vs greedy

  37. Outline • Introduction • Dryad • DryadLINQ • Vertex performance prediction • Scheduling for heterogeneous clusters • Causal models for performance debugging • Support software • Quincy scheduler • TidyFS distributed filesystem • Artemis monitoring system • Nectar data management • Current status

  38. Edison • New project • Position paper to appear at HotOS 11 • Joint work with Moises Goldszmidt • Performance problems in Dryad clusters • Resource contention • Data or computation skew • Hardware issues • Often transient • Use active intervention • Re-run the vertex in a sandbox on the cluster • Construct experiments using its causal model • Systematically probe behavior: fix some variables while altering others

  39. Circuit blueprint • Given partial observations, lets us make inferences about the state of the circuit • If we intervene and fix some inputs, lets us make inferences about the state of the circuit G3 G4 G1 A G2 B

  40. Blueprint of a vertex Disk congestion Net congestion Phase Rate data_in Data size CPU Reading time Computing time Blueprint for inferring State from both Observations and Interventions Running time Answer “what-if” questions Root-cause analysis

  41. Overview • Introduction • Dryad • DryadLINQ • Vertex performance prediction • Scheduling for heterogeneous clusters • Causal models for performance debugging • Support software • Quincy scheduler • TidyFS distributed filesystem • Artemis monitoring system • Nectar data management • Current status

  42. Quincy (Isard et al, SOSP 09) • Need to share the cluster between multiple, concurrently executing jobs • Goals are fairness and data locality • If job x takes t seconds when it is run exclusively on the cluster then x should take no more than Jt seconds when the cluster has J jobs • Very large datasets stored on the cluster itself means that unnecessary data movement is costly • These goals are conflicting • Optimal data locality => delay job until resources available • Fairness => allocate resources as soon as available

  43. Quincy (cont) • Strategies for fairness: • Sacrifice locality • Kill already running jobs • Admission control • Fairness is achieved at a cost to throughput

  44. Quincy (cont) • Quantify every scheduling decision • Data transfer cost • Cost in wasted time if a task is killed • Express scheduling problem as a flow network • Represent all worker tasks that are ready to run with preferred locations, and all currently running tasks • Edge weights and capacities encode the scheduling policy • Produces a set of scheduling assignments for all jobs at once that satisfy a global criterion • Solve online with standard min-cost flow algorithm • Graph is updated whenever anything changes

  45. TidyFS(Fetterly et al, USENIX 11) • Simple distributed file system • Like HDFS or GFS • Highly optimized to perform well for data-parallel computations: • Data streams are striped across cluster nodes • Stream parts are read or written in parallel • By a single process • I/O is sequential for high throughput • Streams are replaced rather than modified • In case of failure, missing parts of output streams can easily be regenerated

  46. TidyFS (cont) • Data streams contain parts • Parts are replicated lazily • Failure before replication is complete is handled by Dryad regenerating the missing part(s) • Parts are “native” files, eg NTFS files or SQL Server database • Read and written using native APIs • Centralized meta-data server • Replicated for fault tolerance • Replicas synchronize using Paxos

  47. Artemis (Crețu-Ciocârlie et al, WASL 08) • Management and analysis of Dryad logs • Each vertex produces around 1MB/s/process • A single Dryad job can easily produce >1TB of log data • Runs and logs continuously on the cluster • To locate and collate log data for a particular job, itself runs a DryadLINQ computation on the cluster • Combines job manager and vertex logs with over 80 Windows performance counters • Sophisticated GUI for post-processing and visualization • Histograms and time series especially helpful for performance debugging

  48. Nectar (Gunda et al, OSDI 10) • Key idea: Data and the computation that generates it are interchangeable • Datasets are uniquely identified by the programs that produce them • Automatic data management • Cluster-wide caching service • Re-use of common datasets to save computation and space • Garbage collection of obsolete datasets • Data can be transparently regenerated • Nectar client service interposes on the DryadLINQ compiler • Consults Nectar cache server and rewrites the program appropriately

  49. Outline • Introduction • Dryad • DryadLINQ • Vertex performance prediction • Scheduling for heterogeneous clusters • Causal models for performance debugging • Support software • Quincy scheduler • TidyFS distributed filesystem • Artemis monitoring system • Nectar data management • Current status

  50. Current status • Imminent release of Dryad and DryadLINQ on Windows HPC • Uses HPC scheduler and other cluster services • Includes TidyFSas DSC (Distributed Storage Catalog) • Can download and try Technology Preview • Ongoing research… • Naiad: allowing cycles in Dryad graphs • Strongly connected components: more general programming model • Loops and convergence tests without the need for driver programs • Continuous queries on streaming data

More Related