html5-img
1 / 71

Cluster Computing with DryadLINQ

Cluster Computing with DryadLINQ. Mihai Budiu, MSR-SVC PARC, May 8 2008. Aknowledgments. MSR SVC and ISRC SVC Michael Isard, Yuan Yu, Andrew Birrell , Dennis Fetterly Ulfar Erlingsson, Pradeep Kumar Gunda, Jon Currey. Computer Evolution. ?. 1961. 2008. 2040. Computer Evolution. ?.

byron-gay
Download Presentation

Cluster Computing with DryadLINQ

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008

  2. Aknowledgments MSR SVC and ISRC SVC Michael Isard, Yuan Yu, Andrew Birrell, Dennis Fetterly Ulfar Erlingsson, Pradeep Kumar Gunda, Jon Currey

  3. Computer Evolution ? 1961 2008 2040

  4. Computer Evolution ? ENIAC 194330 tons 200kW Datacenter 2008500,000 ft2 40MW 2040

  5. 2040

  6. Layers Applications Programming Languages and APIs Resource Management Scheduling Distributed Execution Operating System Caching and Synchronization Storage Identity & Security Networking

  7. Pieces of the Global Computer

  8. This Work

  9. The Rest of This Talk Machine Learning Large Vectors DryadLINQ Dryad Distributed Filesystem CIFS/NTFS Cluster Services Windows Server Windows Server Windows Server Windows Server

  10. TeraSort • How fast can you sort 1010 100-byte records (1Tb)? • Sequential scan/disk = 4.6 hours • Current record: 435 seconds (7.2 min)cluster of 40 Itanium2, 2520 SAN disks • Code: 3300 lines of C • Our result: 349 seconds (5.8 min)cluster of 240 AMD64 (quad) machines, 920 disks • Code: 17 lines of LINQ

  11. Outline • Introduction • Dryad • DryadLINQ • Building on DryadLINQ

  12. Outline • Introduction • Dryad • deployed since 2006 • many thousands of machines • analyzes many petabytes of data/day • DryadLINQ • Building on DryadLINQ

  13. Goal

  14. Design Space Grid Internet Data- parallel Dryad Search Shared memory Private data center Transaction HPC Latency Throughput

  15. Data Partitioning DATA RAM DATA

  16. 2-D Piping • Unix Pipes: 1-D grep | sed | sort | awk | perl • Dryad: 2-D grep1000 | sed500 | sort1000 | awk500 | perl50

  17. Dryad = Execution Layer Job (application) Pipeline ≈ Dryad Shell Cluster Machine

  18. Virtualized 2-D Pipelines

  19. Virtualized 2-D Pipelines

  20. Virtualized 2-D Pipelines

  21. Virtualized 2-D Pipelines

  22. Virtualized 2-D Pipelines • 2D DAG • multi-machine • virtualized

  23. Dryad Job Structure Channels Inputfiles Stage Outputfiles sort grep awk sed perl sort grep awk sed grep sort Vertices (processes)

  24. Channels • Finite Streams of items • distributed filesystem files (persistent) • SMB/NTFS files (temporary) • TCP pipes (inter-machine) • memory FIFOs (intra-machine) X Items M

  25. Architecture data plane Files, TCP, FIFO, Network job schedule V V V NS PD PD PD control plane Job manager cluster

  26. Fault Tolerance

  27. Dynamic Graph Rewriting X[0] X[1] X[3] X[2] X’[2] Slow vertex Duplicatevertex Completed vertices Duplication Policy = f(running times, data volumes)

  28. Dynamic Aggregation S S S S S S T static S S S S S S # 1 # 2 # 1 # 3 # 3 # 2 rack # A A A # 1 # 2 # 3 T dynamic

  29. Data-Parallel Computation Parallel Databases Map-Reduce Application Dryad Execution GFSBigTable Storage

  30. Outline • Introduction • Dryad • DryadLINQ • Building on Dryad

  31. DryadLINQ Dryad

  32. LINQ Collection<T> collection; bool IsLegal(Key); string Hash(Key); var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};

  33. DryadLINQ = LINQ + Dryad Collection<T> collection; boolIsLegal(Key k); string Hash(Key); var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value}; Vertexcode Queryplan (Dryad job) Data collection C# C# C# C# results

  34. Data Model C# objects Partition Collection

  35. Query Providers Client machine DryadLINQ C# Data center Distributed query plan Invoke Query Expr Query ToDryadTable Input Tables JM Dryad Execution Output DryadTable C# Objects Results Output Tables (11) foreach

  36. Demo

  37. Example: Histogram public static IQueryable<Pair> Histogram( IQueryable<LineRecord> input, int k) { var words = input.SelectMany(x => x.line.Split(' ')); vargroups = words.GroupBy(x => x); varcounts = groups.Select(x => new Pair(x.Key, x.Count())); varordered = counts.OrderByDescending(x => x.count); var top = ordered.Take(k); return top; }

  38. Histogram Plan SelectMany HashDistribute Merge GroupBy Select OrderByDescendingTake MergeSort Take

  39. Map-Reduce in DryadLINQ public static IQueryable<S> MapReduce<T,M,K,S>( this IQueryable<T> input, Expression<Func<T, IEnumerable<M>>> mapper, Expression<Func<M,K>> keySelector, Expression<Func<IGrouping<K,M>,S>> reducer) { var map = input.SelectMany(mapper); var group = map.GroupBy(keySelector); var result = group.Select(reducer); return result; }

  40. Map-Reduce Plan map M M M M M M M Q Q Q Q Q Q Q sort groupby M G1 G1 G1 G1 G1 G1 G1 map R R R R R R R reduce D distribute D D D D D D D G (1) (2) (3) R mergesort MS MS MS MS MS groupby partial aggregation X G2 G2 G2 G2 G2 reduce R R R R R X X X mergesort MS MS groupby G2 G2 reduce R R reduce S S S S S S consumer A A A X X T

  41. Distributed Sorting in DryadLINQ public static IQueryable<TSource> DSort<TSource, TKey>(this IQueryable<TSource> source,                                  Expression<Func<TSource, TKey>> keySelector, intpcount) { var samples = source.Apply(x => Sampling(x)); var keys = samples.Apply(x => ComputeKeys(x, pcount)); var parts = source.RangePartition(keySelector, keys);             return parts.OrderBy(keySelector); }

  42. Distributed Sorting Plan DS DS DS DS DS H H H O D D D D D (1) (2) (3) M M M M M S S S S S

  43. Outline • Introduction • Dryad • DryadLINQ • Building on DryadLINQ

  44. Machine Learning in DryadLINQ Data analysis Machine learning Large Vector DryadLINQ Dryad

  45. Operations on Large Vectors: Map 1 T f U f preserves partitioning T f U

  46. Map 2 (Pairwise) T f U V T U f V

  47. Map 3 (Vector-Scalar) T f U V T U f V 47

  48. Reduce (Fold) f U U U U f f f U U U f U

  49. Linear Algebra T T V = U , ,

  50. Linear Regression • Data • Find • S.t.

More Related