1 / 31

Dryad and DryaLINQ

Dryad and DryaLINQ. Dryad and DryadLINQ. Dryad provides automatic distributed execution DryadLINQ provides automatic query plan generation. Dryad. General-purpose execution environment for distributed, data-parallel applications

max
Download Presentation

Dryad and DryaLINQ

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dryad and DryaLINQ

  2. Dryad and DryadLINQ Dryad provides automatic distributed execution DryadLINQ provides automatic query plan generation

  3. Dryad • General-purpose execution environment for distributed, data-parallel applications • Focus on simplicity, reliability, scalability, efficiency and not latency, unreliable networks • Automatic management of scheduling, distribution, fault tolerance • Exploits Data Parallelism

  4. Dryad • Computations expressed as a Directed Acyclic Graph • Jobs executed on vertices • Edges are communication channels • Each vertex has several input and output edges • Data transport mechanisms: Files, TCP pipes, shared memory FIFOs

  5. Job = Directed Acyclic Graph Outputs Processing vertices Channels (file, pipe, shared memory) Inputs

  6. Dryad vs. MapReduce, Parallel DB • More control to developer than MapReduce • MapReduce aims at simplicity at the expense of generality and performance • Computation Graph is implicit in Parallel DB

  7. Dryad System Architecture • Job manager – coordinates jobs, constructs graph • Name server – exposes computers with network topology • Daemons run on each computer in the cluster

  8. Communication

  9. Job (Graph) Construction • Using graph operators implemented in C++ to describe the graph (from simpler sub graphs).

  10. Job Execution • Job manager not currently fault tolerant • Vertices may be scheduled multiple times due to failures • Each execution versioned • Execution record kept- including versions of incoming vertices • Outputs are uniquely named (versioned) • Final outputs selected if job completes • Non-file communication (TCP pipe, Shared Memory FIFO) may cascade failures • Vertices specify hard constraints or preferences for set of computers required • Scheduling is greedy assuming only one job

  11. Policy Managers R R R R Stage R Connection R-X X X X X Stage X R-X Manager X Manager R manager Job Manager

  12. Cluster network topology top-level switch top-of-rack switch rack

  13. Run-time Graph Refinement

  14. Dynamic Aggregation S S S S S S T static S S S S S S # 1 # 2 # 1 # 3 # 3 # 2 rack # A A A # 1 # 2 # 3 T dynamic

  15. Fault Tolerance

  16. SkyServer DB Query • 3-way join to find gravitational lens effect • Table U: (objId, color) 11.8GB • Table N: (objId, neighborId) 41.8GB • Find neighboring stars with similar colors: • Join U+N to find T = U.color,N.neighborId where U.objId = N.objId • Join U+T to find U.objId where U.objId = T.neighborID and U.color ≈ T.color

  17. H n Y Y [distinct] [merge outputs] select u.color,n.neighborobjid from u join n where u.objid = n.objid select u.objid from u join <temp> where u.objid = <temp>.neighborobjid and |u.color - <temp>.color| < d (u.color,n.neighborobjid) [re-partition by n.neighborobjid] [order by n.neighborobjid] U U u: objid, color n: objid, neighborobjid [partition by objid] 4n S S 4n M M n D D n X X U N U N SkyServer DB query • Took SQL plan • Manually coded in Dryad • Manually partitioned data

  18. Optimization H n Y Y U U 4n S S 4n M M n D D n X X U N U N Y U S S S S M M M M D X U N

  19. Optimization H n Y Y U U 4n S S 4n M M n D D n X X U N U N Y U S S S S M M M M D X U N

  20. 16.0 Dryad In-Memory 14.0 Dryad Two-pass 12.0 SQLServer 2005 10.0 Speed-up 8.0 6.0 4.0 2.0 0.0 0 2 4 6 8 10 Number of Computers

  21. High level Programming Languages • Nebula – limited to existing binaries • SSIS – SQLServer workflow engine, distributed • DryadLINQ – Supports both imperative and declarative operations on datasets

  22. Dryad/DryadLINQ • Decoupling of Dryad and DryadLINQ • Dryad: execution engine (given DAG, do scheduling and fault tolerance) • DryadLINQ: programming model (given query, generate DAG)

  23. DryadLINQ • Exploits LINQ (Relational queries integrated in C#) to provide a hybrid of imperative and declarative programming • LINQ has a design choice that is easy to express computations also giving runtime leeway implementing them. • Sequential program composed of LINQ expressions • Performs side-effect free transformations on datasets • Written and Debugged using .NET development tools • More general than distributed SQL • Programs can be automatically optimized and efficiently executed on large cluster

  24. DryadLINQ • Serialization for dryad are provided by High level software layers like DrayLINQ • DrayLINQ preserves the LINQ programming model and defines new operators and datatypes for data parallel programming

  25. DryadLINQ Architecture

  26. DryadLINQ Data Model .Net objects Partition Partitioned Table • Data Model is distributed implementation of LINQ Collections • Each Dataset is distributed (disjoint) across the cluster • Partitioned table exposes metadata information • type, partition, compression scheme, serialization, etc.

  27. DrayLINQ Constructs • Expressions must be side-effect free • Allows programmer to specify annotations (hints) to guide optimization • Operators • Hash Partition • Range Partition • Apply: Allows arbitrary streaming computations • Fork: Takes single input and generates multiple output datasets

  28. System Implementation • Execution Plan Graph: Starts by converting raw LINQ expressions into EPG • DryadLINQ Optimizations • Static Optimizations • Dynamic Optimizations • Code Generation: Uses dynamic code generation to automatically synthesize LINQ code to be run at the Drayad vertex

  29. Conclusions • Goal: Use a compute cluster as if it is a single computer • Dryad/DryadLINQ represent a significant step • Requires close collaborations across many fields of computing, including • Distributed systems • Distributed and parallel databases • Programming language design and analysis

  30. References • Dryad: Distributed Data-parallel Programs from Sequential Building Blocks (Michael Isard, MihaiBudiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly March 2007) • DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language (Yuan Yu, Michael Isard, Dennis Fetterly, MihaiBudiu, ÚlfarErlingsson, Pradeep Kumar Gunda, and Jon CurreyDecember 2008)

  31. Thank you

More Related