1 / 27

DryadLINQ: Computer Vision (among other things) on a cluster

DryadLINQ: Computer Vision (among other things) on a cluster. ECCV AC workshop 14 th June, 2008. Michael Isard Microsoft Research, Silicon Valley. Parallel programming, yada yada. Intel claims we will all have many-core, etc. “This algorithm is easily parallelizable”

dena
Download Presentation

DryadLINQ: Computer Vision (among other things) on a cluster

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DryadLINQ: Computer Vision (among other things) on a cluster ECCV AC workshop14th June, 2008 Michael Isard Microsoft Research, Silicon Valley

  2. Parallel programming, yada yada • Intel claims we will all have many-core, etc. • “This algorithm is easily parallelizable” • Not “we implemented a parallel version” • Historically, low-latency fine-grain parallelism • Shared-memory SMP (threads, locks, etc.) • MPI (finite-element analysis, etc.) • But also data-parallel! • We have lots of data now (video, the web) • But most people still use their laptops/toy data • Even “big” systems use tens of computers

  3. Why do people use Matlab? • Parallel programming tedious and complex • Distributed programming even worse • Perl scripts, manual management of data, … • Matlab is easy (or at least popular) • Relatively few high-level constructs • System “does the right thing” • Programmers willing to put up with a lot • We want similarly low barrier to entry • Familiar languages, legacy codebase, etc.

  4. What are we doing? • When single-computer processing runs out of steam • Web-scale processing of terabytes of data • Infeasible without a big cluster • Network log-mining, machine learning • Multi-week job → 4 hours on 250 computers • 1-hour iteration → 3.5 minutes on 4 computers

  5. A typical data-intensive query var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access in logentries where access.user.EndsWith(@"\ulfar") select access; var accesses = from access in user group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count()); var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access;

  6. Steps in the query var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access in logentries where access.user.EndsWith(@"\ulfar") select access; var accesses = from access in user group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count()); var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access; Go through logs and keep only lines that are not comments. Parse each line into a LogEntry object. Go through logentries and keep only entries that are accesses by ulfar. Group ulfar’s accesses according to what page they correspond to. For each page, count the occurrences. Sort the pages ulfar has accessed according to access frequency.

  7. Serial execution var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access in logentries where access.user.EndsWith(@"\ulfar") select access; var accesses = from access in user group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count()); var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access; For each line in logs, do… For each entry in logentries, do.. Sort entries in user by page. Then iterate over sorted list, counting the occurrences of each page as you go. Re-sort entries in access by page frequency.

  8. Parallel execution var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access in logentries where access.user.EndsWith(@"\ulfar") select access; var accesses = from access in user group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count()); var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access;

  9. Linear Regression Vectors x = input(0), y = input(1); Matrices xx = x.PairwiseOuterProduct(x); OneMatrix xxs = xx.Sum(); Matrices yx = y.PairwiseOuterProduct(x); OneMatrix yxs = yx.Sum(); OneMatrix xxinv = xxs.Map(a => a.Inverse()); OneMatrix A = yxs.Map( xxinv, (a, b) => a.Multiply(b)); 9

  10. Execution Graph X[0] X[1] X[2] Y[0] Y[1] Y[2] X×XT X×XT X×XT Y×XT Y×XT Y×XT Σ Σ [ ]-1 * 10 A

  11. DryadLINQ • Programmer writes sequential C# code • Rich type system, libraries, modules, loops… • System can figure out data-parallelism • Sees declarative expression plans • Full control of high-level optimizations • Traditional parallel-database tricks

  12. Dryad execution engine Andrew Birrell, Mihai Budiu, Dennis Fetterly, Michael Isard, Yuan Yu • General-purpose execution environment for distributed, data-parallel applications • Concentrates on throughput not latency • Assumes private data center • Automatic management of scheduling, distribution, fault tolerance, etc. • Well tested over two years on clusters of thousands of computers

  13. Job = Directed Acyclic Graph Outputs Processing vertices Channels (file, pipe, shared memory) Inputs

  14. Scheduler state machine • Scheduling a DAG • Vertex can run anywhere once all its inputs are ready • Constraints/hints place it near its inputs • Fault tolerance • If A fails, run it again • If A’s inputs are gone, run upstream vertices again (recursively) • If A is slow, run another copy elsewhere and use output from whichever finishes first

  15. Static/dynamic optimizations • Static optimizer builds execution graph • Dynamic optimizer mutates running graph • Picks number of partitions when size is known • Builds aggregation trees based on locality

  16. LINQ • Constructs/type system in .NET v3.5 • Operators to manipulate datasets • Data elements are arbitrary .NET types • Traditional relational operators • Select, Join, Aggregate, etc. • Extensible • Add new operators • Add new implementations

  17. DryadLINQ Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, Jon Currey • Automatically distribute a LINQ program • Few Dryad-specific extensions • Same source program runs on single-core through multi-core up to cluster

  18. A complete DryadLINQ program public class LogEntry { public string user; public string ip; public string page; public LogEntry(string line) { string[] fields = line.Split(' '); this.user = fields[8]; this.ip = fields[9]; this.page = fields[5]; } } public class UserPageCount { public string user; public string page; public int count; public UserPageCount(string user, string page, int count) { this.user = user; this.page = page; this.count = count; } } DryadDataContext ddc = new DryadDataContext(“fs://logfile”); DryadTable<string> logs = ddc.GetTable<string>(); var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access in logentries where access.user.EndsWith(@"\ulfar") select access; var accesses = from access in user group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count()); var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access; htmAccesses.ToDryadTable(“fs://results”)

  19. DryadLINQ: From LINQ to Dryad Automatic query plan generation Distributed query execution by Dryad LINQ query Query plan Dryad var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); select where logs

  20. How does it work? • Sequential code “operates” on datasets • But really just builds an expression graph • Lazy evaluation • When a result is retrieved • Entire graph is handed to DryadLINQ • Optimizer builds efficient DAG • Program is executed on cluster

  21. Terasort • 10 billion 100-byte records (1012 bytes) • 240 computers, 960 disks • 349 secs • Comparable with record public struct TeraRecord : IComparable<TeraRecord> { public const int RecordSize = 100; public const int KeySize = 10; public byte[] content; public int CompareTo(TeraRecord rec) { for (int i = 0; i < KeySize; i++) { int cmp = this.content[i] - rec.content[i]; if (cmp != 0) return cmp; } return 0; } public static TeraRecord Read(DryadBinaryReader rd) { TeraRecord rec; rec.content = rd.ReadBytes(RecordSize); return rec; } public static int Write(DryadBinaryWriter wr, TeraRecord rec) { return wr.WriteBytes(rec.content); } } class Terasort { public static void Main(string[] args) DryadDataContext ddc = new DryadDataContext(@"file://\\svc-yuanbyu-00\dryad\terasort"); DryadTable<TeraRecord> records = ddc.GetPartitionedTable<TeraRecord>("sherwood-sort2.pt"); var q = records.OrderBy(x => x); q.ToDryadPartitionedTable("sherwood-sort2.pt"); } }

  22. Machine Learning in DryadLINQ Kannan Achan, Mihai Budiu Data analysis Machine learning Large Vector DryadLINQ Dryad 22

  23. Linear Regression Code Vectors x = input(0), y = input(1); Matrices xx = x.PairwiseOuterProduct(x); OneMatrix xxs = xx.Sum(); Matrices yx = y.PairwiseOuterProduct(x); OneMatrix yxs = yx.Sum(); OneMatrix xxinv = xxs.Map(a => a.Inverse()); OneMatrix A = yxs.Map( xxinv, (a, b) => a.Multiply(b)); 23

  24. Expectation Maximization • 160 lines • 3 iterations shown 24

  25. Computer vision • Ongoing • Epitomes, features for image search, … • Anecdotal evidence • Nebojsa Jojic, Anitha Kannan • Tutorial from Mihai • Anitha implemented Probabilistic Image Map algorithm in an afternoon

  26. Continuing research • Application-level research • What can we write with DryadLINQ? • System-level research • Performance, usability, etc. • Lots of interest from learning/vision researchers

More Related