1 / 35

Distributed computing using Dryad

Distributed computing using Dryad. Michael Isard Microsoft Research Silicon Valley. Distributed Data-Parallel C omputing. Cloud Transparent scaling Resource virtualization Commodity clusters Fault tolerance with good performance Workloads beyond standard SQL, HPC

jatin
Download Presentation

Distributed computing using Dryad

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed computingusing Dryad Michael Isard Microsoft Research Silicon Valley

  2. Distributed Data-Parallel Computing • Cloud • Transparent scaling • Resource virtualization • Commodity clusters • Fault tolerance with good performance • Workloads beyond standard SQL, HPC • Data-mining, graph analysis, … • Semi-structured/unstructured data

  3. Execution layer • This talk: system-level middleware • Yuan Yu will describe DryadLINQ programming model on Saturday • Algorithm -> execution plan by magic

  4. Problem domain • Large inputs • Tens of GB “small test dataset” • Single job up to hundreds of TB • Semi-structured data is common • Not latency sensitive • Overhead of seconds for trivial job • Large job could take days • Batch computation not online queries • Simplifies fault tolerance, caching, etc.

  5. Talk overview • Some typical computations • DAG implementation choices • The Dryad execution engine • Comparison with MapReduce • Discussion

  6. Map • Independent transformation of dataset • for each x in S, output x’ = f(x) • E.g. simple grep for word w • output line x only if x contains w

  7. Map • Independent transformation of dataset • for each x in S, output x’ = f(x) • E.g. simple grep for word w • output line x only if x contains w S f S’

  8. Map • Independent transformation of dataset • for each x in S, output x’ = f(x) • E.g. simple grep for word w • output line x only if x contains w S1 f S1’ S2 f S2’ S3 f S3’

  9. Reduce • Grouping plus aggregation • 1) Group x in S according to key selector k(x) • 2) For each group g, output r(g) • E.g. simple word count • group by k(x) = x • for each group g output key (word) and count of g

  10. Reduce • Grouping plus aggregation • 1) Group x in S according to key selector k(x) • 2) For each group g, output r(g) • E.g. simple word count • group by k(x) = x • for each group g output key (word) and count of g S G r S’

  11. Reduce S G r S’

  12. Reduce D is distribute, e.g. by hash or range S1 D G r S1’ S2 D G r S2’ S3 D G r S3’

  13. Reduce ir is initial reduce, e.g. compute a partial sum S1 G ir D G r S1’ S2 G ir D G r S2’ S3 G ir D

  14. K-means • Set of points P, initial set of cluster centres C • Iterate until convergence: • For each c in C • Initialize countc, centrec to 0 • For each p in P • Find c in C that minimizes dist(p,c) • Update: countc += 1, centrec += p • For each c in C • Replace c <- centrec/countc

  15. K-means C0 ac cc C1 P ac cc C2 ac cc C3

  16. K-means C0 ac cc C1 ac ac P1 P2 ac cc C2 P3 ac ac ac cc C3 ac ac

  17. Graph algorithms • Set N of nodes with data (n,x) • Set E of directed edges (n,m) • Iterate until convergence: • For each node (n,x) in N • For each outgoing edge n->m in E, nm = f(x,n,m) • For each node (m,x) in N • Find set of incoming updates im = {nm: n->m in E} • Replace (m,x) <- (m,r(im)) • E.g. power iteration (PageRank)

  18. PageRank N0 ae de N1 E ae de N2 ae de N3

  19. PageRank cc N01 ae D N11 cc ae D N12 N02 ae D cc N13 N03 cc E1 ae D N21 cc ae D N22 E2 ae D cc N23 E3 cc ae D N31 cc ae D N32 ae D cc N33

  20. DAG abstraction • Absence of cycles • Allows re-execution for fault-tolerance • Simplifies scheduling: no deadlock • Cycles can often be replaced by unrolling • Unsuitable for fine-grain inner loops • Very popular • Databases, functional languages, …

  21. Rewrite graph at runtime • Loop unrolling with convergence tests • Adapt partitioning scheme at run time • Choose #partitions based on runtime data volume • Broadcast Join vs. Hash Join, etc. • Adaptive aggregation and distribution trees • Based on data skew and network topology • Load balancing • Data/processing skew (cf work-stealing)

  22. Push vs Pull • Databases typically ‘pull’ using iterator model • Avoids buffering • Can prevent unnecessary computation • But DAG must be fully materialized • Complicates rewriting • Prevents resource virtualization in shared cluster S1 D G r S1’ S2 D G r S2’

  23. Fault tolerance • Buffer data in (some) edges • Re-execute on failure using buffered data • Speculatively re-execute for stragglers • ‘Push’ model makes this very simple

  24. Dryad • General-purpose execution engine • Batch processing on immutable datasets • Well-tested on large clusters • Automatically handles • Fault tolerance • Distribution of code and intermediate data • Scheduling of work to resources

  25. Dryad System Architecture Scheduler R

  26. Dryad System Architecture Scheduler R R

  27. Dryad System Architecture Scheduler R R

  28. Dryad Job Model • Directed acyclic graph (DAG) • Clean abstraction • Hides cluster services • Clients manipulate graphs • Flexible and expressive • General-purpose programs • Complicated execution plans

  29. Dryad Inputs and Outputs • Partitioned data set • Records do not cross partition boundaries • Data on compute machines: NTFS, SQLServer, … • Optional semantics • Hash-partition, range-partition, sorted, etc. • Loading external data • Partitioning “automatic” • File system chooses sensible partition sizes • Or known partitioning from user

  30. Channel abstraction S1 G ir D G r S1’ S2 G ir D G r S2’ S3 G ir D

  31. Push vs Pull • Channel types define connected component • Shared-memory or TCP must be gang-scheduled • Pull within gang, push between gangs

  32. MapReduce (Hadoop) • MapReduce restricts • Topology of DAG • Semantics of function in compute vertex • Sequence of instances for non-trivial tasks S1 f G ir D G r S1’ S2 f G ir D G r S2’ S3 f G ir D

  33. MapReduce complexity • Simple to describe MapReduce system • Can be hard to map algorithm to framework • cf k-means: combine C+P, broadcast C, iterate, … • HIVE, PigLatin etc. mitigate programming issues • Implementation not uniform • Different fault-tolerance for mappers, reducers • Add more special cases for performance • Hadoop introducing TCP channels, pipelines, … • Dryad has same state machine everywhere

  34. Discussion • DAG abstraction supports many computations • Can be targeted by high-level languages! • Run-time rewriting extends applicability • DAG-structured jobs scale to large clusters • Over 10k computers in large Dryad clusters • Transient failures common, disk failures daily • Trade off fault-tolerance against performance • Buffer vs TCP, still manual choice in Dryad system • Also external vs in-memory working set

  35. Conclusion • Dryad well-tested, scalable • Daily use supporting Bing for over 3 years • Applicable to large number of computations • 250 computer cluster at MSR SVC, Mar->Nov 09 • 47 distinct users (~50 lab members + interns) • 15k jobs (tens of millions of processes executed) • Hundreds of distinct programs • Network trace analysis, privacy-preserving inference, light-transport simulation, decision-tree training, deep belief network training, image feature extraction, …

More Related