1 / 30

From LINQ to DryadLINQ

From LINQ to DryadLINQ. Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ. Overview. From sequential code to parallel execution Dryad fundamentals Simple program example, plan for practicals. Distributed computation. Single computer, shared memory

Download Presentation

From LINQ to DryadLINQ

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ

  2. Overview • From sequential code to parallel execution • Dryad fundamentals • Simple program example, plan for practicals

  3. Distributed computation • Single computer, shared memory • All objects always available for read and write • Cluster of workstations • Each computer sees a subset of objects • Writes on one computer must be explicitly shared • System automatically handles complexity • Needs some help

  4. Data-parallel computation • LINQ is high-level declarative specification • Same action on entire collection of objects • set.Select(x => f(x)) • Compute f(x) on each x in set, independently • set.GroupBy(x => key(x)) • Group by unique keys, independently • set.OrderBy(x => key(x)) • Sort whole set (system chooses how)

  5. Distributed cluster computing • Dataset is stored on local disks of cluster set set.0 set.7 set.1 set.6 set.4 set.3 set.2 set.5

  6. Distributed cluster computing • Dataset is stored on local disks of cluster set.0 set.7 set.1 set.6 set.4 set.3 set.2 set.5

  7. Simple distributed computation varset2 = set.Select(x => f(x)) set set2

  8. Simple distributed computation varset2 = set.Select(x => f(x)) set2.6 set2.0 set2.1 set2.7 set2.3 set2.4 set2.5 set2.2 set.0 set.5 set.3 set.4 set.6 set.1 set.7 set.2

  9. Simple distributed computation varset2 = set.Select(x => f(x)) set2.7 set2.6 set2.5 set2.4 set2.3 set2.1 set2.0 set2.2 f f f f f f f f set.6 set.5 set.4 set.3 set.2 set.1 set.0 set.7

  10. Simple distributed computation varset2 = set.Select(x => f(x)) set2.7 set2.6 set2.5 set2.4 set2.3 set2.1 set2.0 set2.2 f f f f f f f f set.6 set.5 set.4 set.3 set.2 set.1 set.0 set.7

  11. Distributed acyclic graph • Computation reads and writes along edges • Graph shows parallelism via independence • Goals of DryadLINQ optimizer • Extract parallelism (find independent work) • Control data skew (balance work across nodes) • Limit cross-computer data transfer

  12. Distributed grouping vargroups = set.GroupBy(x => x.key) • set is a collection of records each with a key • Don’t know what keys are present • Or in which partitions • First, reorganize data • All records with same key on same computer • Then can do final grouping in parallel

  13. Distributed grouping vargroups = set.GroupBy(x => x.key) d b a d b a a c set a c a d d b b a hash partition by key a c a a d d b b group locally groups

  14. Distributed grouping vargroups = set.GroupBy(x => x.key) d b a d b a a c set a c a d d b b a hash partition by key a c a a a aa c d d b b b b d d group locally a aa c b b d d groups

  15. Distributed sorting varsorted = set.OrderBy(x => x.key) 1 1 100 1 4 1 2 3 set 100 1 1 1 2 3 4 1 sample 100 1 3 1 compute histogram range partition by key sort locally sorted

  16. Distributed sorting varsorted = set.OrderBy(x => x.key) 1 1 100 1 4 1 2 3 set 100 1 1 1 2 3 4 1 sample 100 1 3 1 [1,1] [2,100] compute histogram 100 1 1 1 2 3 4 1 range partition by key 100 2 3 4 1 1 1 1 sort locally sorted

  17. Distributed sorting varsorted = set.OrderBy(x => x.key) 1 1 100 1 4 1 2 3 set 100 1 1 1 2 3 4 1 sample [1,1] [2,100] compute histogram 100 1 1 1 2 3 4 1 range partition by key 2 3 4 100 1 1 1 1 100 2 3 4 1 1 1 1 sort locally 1 1 1 1 2 3 4 100 sorted

  18. Additional optimizations b d b d a a d d a b b a a b b a set varhistogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()}) a b b a a a d d b d b d a b b a hash partition by key b b d d b d b d b b a a a a a a group locally count histogram

  19. Additional optimizations b d b d a a d d a b b a a b b a set varhistogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()}) a b b a a a d d b d b d a b b a hash partition by key b b d d b d b d b b b bbbbb d ddd a aaaaa a a a a a a group locally b bbbbb d ddd a aaaaa count histogram

  20. Additional optimizations b d b d a a d d a b b a a b b a set varhistogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()}) a b b a a a d d b d b d a b b a hash partition by key b bbbbb d ddd a aaaaa group locally b bbbbb d ddd a aaaaa count b,6 d,4 a,6 b,6 d,4 histogram a,6

  21. b d b d a a d d a b b a a b b a set varhistogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()}) a,2 b,2 a,2 d,2 b,2 d,2 a,2 b,2 group locally a,2 b,2 a,2 d,2 b,2 d,2 a,2 b,2 hash partition by key b,2 d,2 b,2 d,2 b,2 a,2 a,2 a,2 group locally combine counts histogram

  22. b d b d a a d d a b b a a b b a set varhistogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()}) a,2 b,2 a,2 d,2 b,2 d,2 a,2 b,2 group locally a,2 b,2 a,2 d,2 b,2 d,2 a,2 b,2 hash partition by key b,2 d,2 b,2 d,2 b,2 b,2 b,2b,2 d,2 d,2 a,2 a,2 a,2 a,2 a,2a,2 group locally b,2 b,2b,2 d,2 d,2 a,2 a,2a,2 combine counts histogram

  23. b d b d a a d d a b b a a b b a set varhistogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()}) a,2 b,2 a,2 d,2 b,2 d,2 a,2 b,2 group locally a,2 b,2 a,2 d,2 b,2 d,2 a,2 b,2 hash partition by key b,2 b,2b,2 d,2 d,2 a,2 a,2a,2 group locally b,6 d,4 b,2 b,2b,2 d,2 d,2 a,2 a,2a,2 a,6 combine counts b,6 d,4 a,6 histogram

  24. What Dryad does • Abstracts cluster resources • Set of computers, network topology, etc. • Schedule DAG: choose cluster computers • Fairly among competing jobs • So computation is close to data • Recovers from transient failures • Rerun computations on machine or network fault • Speculate duplicates for slow computations

  25. Resources are virtualized • Each graph node is process • Writes outputs to disk • Reads inputs from upstream nodes’ output files • Graph generally larger than cluster • 1TB input, 250MB partition, 4000 parts • Cluster is shared • Don’t size program for exact cluster • Use whatever share of resources are available

  26. What controls parallelism • Initially based on partitioning of inputs • After reorganization, system or user decides

  27. DryadLINQ-specific operators • set = PartitionedTable.Get<T>(uri) • set.ToPartitionedTable(uri) • set.HashPartition(x => f(x), numberOfParts) • set.AssumeHashPartition(x => f(x)) • [Associative] f(x) { … } • RangePartition(…), Apply(…), Fork(…) • [Decomposable], [Homomorphic], [Resource] • Field mappings, Multiple partitioned tables, …

  28. using System; using System.Collections.Generic; using System.Linq; using System.Text; using LinqToDryad; namespace Count { class Program { public const string inputUri = @"tidyfs://datasets/Count/inputfile1.pt"; static void Main(string[] args) { PartitionedTable<LineRecord> table = PartitionedTable.Get<LineRecord>(inputUri); Console.WriteLine("Lines: {0}", table.Count()); Console.ReadKey(); } } }

  29. Form into groups • 9 groups, one MSRI member per group • Try to pick common interest for project later

  30. sherwood-246 — sherwood-253,sherwood-255 d:\dryad\data\Workshop\DryadLINQ\samples Count, Points, Robots Cluster job browser d:\dryad\data\Workshop\DryadLINQ\job_browser\DryadAnalysis.exe TidyFS(file system) browser d:\dryad\data\Workshop\DryadLINQ\bin\retail\tidyfsexplorerwpf.exe

More Related