1 / 38

Getting the most out of Parallel Extensions for .NET

Getting the most out of Parallel Extensions for .NET. Dr. Mike Liddell Senior Developer Microsoft (mikelid@microsoft.com). Agenda. Why parallelism, why now? Parallelism with today’s technologies Parallel Extensions to the .NET Framework PLINQ Task Parallel Library

kory
Download Presentation

Getting the most out of Parallel Extensions for .NET

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Getting the most out of Parallel Extensions for .NET Dr. Mike Liddell Senior Developer Microsoft (mikelid@microsoft.com)

  2. Agenda • Why parallelism, why now? • Parallelism with today’s technologies • Parallel Extensions to the .NET Framework • PLINQ • Task Parallel Library • Coordination Data Structures • Demos

  3. Sun’s Surface 10,000 1,000 100 10 1 Rocket Nozzle Nuclear Reactor Power Density (W/cm2) 8086 Hot Plate 4004 8085 Pentium® processors 8008 386 286 486 8080 ‘70 ‘80 ‘90 ‘00 ‘10 Hardware Paradigm Shift Today’s Architecture: Heat becoming an unmanageable problem! To Grow, To Keep Up, We Must Embrace Parallel Computing 32,768 2,048 128 16 Many-core Peak Parallel GOPs Parallelism Opportunity 80X GOPS Single Threaded Perf 10% per year 2004 2006 2008 2010 2012 2015 Intel Developer Forum, Spring 2004 - Pat Gelsinger “… we see a very significant shift in what architectures will look like in the future ...fundamentally the way we've begun to look at doing that is to move from instruction level concurrency to … multiple cores per die. But we're going to continue to go beyond there. And that just won't be in our server lines in the future; this will permeate every architecture that we build. All will have massivelymulticore implementations.” Pat Gelsinger Chief Technology Officer, Senior Vice President, Intel Corporation

  4. It's An Industry Thing • Open MP • Intel TBB • Java libraries • Open CL • CUDA • MPI • Erlang • Cilk • (many others)

  5. demo • Raytracer

  6. What's the Problem? • Multithreaded programming is “hard” today • Robust solutions only by specialists • Parallel patterns are not prevalent, well known, nor easy to implement • Many potential correctness & performance issues • Races, deadlocks, livelocks, lock convoys, cache coherency overheads, missed notifications, non-serializable updates, priority inversion, false-sharing, sub-linear scaling and so on… • Features that can are often skimped on • Last delta of perf, ensuring no missed exceptions, composable cancellation, dynamic partitioning, efficient and custom scheduling • Businesses have little desire to “go deep” • Developers should focus on business value, not concurrency hassles and common concerns

  7. Example: Matrix Multiplication voidMultiplyMatrices(int size, double[,] m1, double[,] m2, double[,] result) { for (int i = 0; i < size; i++) { for (int j = 0; j < size; j++) { result[i, j] = 0; for (int k = 0; k < size; k++) { result[i, j] += m1[i, k] * m2[k, j]; } } } }

  8. Manual Parallel Solution Static Work Distribution intN = size; intP = 2 * Environment.ProcessorCount; intChunk = N / P; ManualResetEventsignal = newManualResetEvent(false); intcounter = P; for (intc = 0; c < P; c++) { ThreadPool.QueueUserWorkItem(o => { intlc = (int)o; for(inti = lc * Chunk; i < (lc + 1 == P ? N : (lc + 1) * Chunk); i++) { // original loop body for(intj = 0; j < size; j++) { result[i, j] = 0; for(intk = 0; k < size; k++) { result[i, j] += m1[i, k] * m2[k, j]; } } } if(Interlocked.Decrement(refcounter) == 0) { signal.Set(); } }, c); } signal.WaitOne(); Potential scalability bottleneck Error Prone Error Prone Manual locking Manual Synchronization

  9. Parallel Solution voidMultiplyMatrices(int size, double[,] m1, double[,] m2, double[,] result) { Parallel.For(0, size, i => { for (int j = 0; j < size; j++) { result[i, j] = 0; for (int k = 0; k < size; k++) { result[i, j] += m1[i, k] * m2[k, j]; } } }); }  Demo!

  10. Parallel Extensions to the .NET Framework • What is it? • Additional APIs shipping in .NET BCL (mscorlib, System, System.Core) • With corresponding enhancements to the CLR & ThreadPool • Provides primitives, task parallelism and data parallelism • Coordination/synchronization constructs (Coordination Data Structures) • Imperative data and task parallelism (Task Parallel Library) • Declarative data parallelism (PLINQ) • Common exception handling model • Common and rich cancellation model • Why do we need it? • Supports parallelism in any .NET language • Delivers reduced concept count and complexity, better time to solution • Begins to move parallelism capabilities from concurrency experts to domain experts

  11. Parallel Extensions Architecture User Code Applications PLINQ Execution Engine Data Partitioning (Chunk, Range, Stripe, Custom) Operators (Map, Filter, Sort, Search, Reduction Merging (Pipeline, Synchronous, Order preserving) Task Parallel Library Coordination Data Structures Thread-safe Collections Coordination Types Cancellation Types Structured Task Parallelism Pre-existing Primitives ThreadPool Monitor, Events, Threads

  12. Task Parallel Library 1st-class debugger support! • System.Threading.Tasks • Task • Parent-child relationships • Structured waiting and cancellation • Continuations on succes, failure, cancellation • Implements IAsyncResult to compose with Async-Programming Model (APM). • Task<T> • A tasks that has a value on completion • Asynchronous execution with blocking on task.Value • Combines ideas of futures, and promises • TaskScheduler • We ship a scheduler that makes full use of the (vastly) improved ThreadPool • Custom Task Schedulers can be written for specific needs. • Parallel • Convenience APIs: Parallel.For(), Parallel.ForEach() • Automatic, scalable & dynamic partitioning.

  13. Task Parallel LibraryLoops • Loops are a common source of work • Can be parallelized when iterations are independent • Body doesn’t depend on mutable state • e.g. static vars, writing to local vars to be used in subsequent iterations for (int i = 0; i < n; i++) work(i); … foreach (T e in data) work(e); Parallel.For(0, n, i => work(i)); … Parallel.ForEach(data, e => work(e));

  14. Task Parallel Library • Supports early exit via a Break API • Parallel.For, Parallel.ForEach for loops. • Parallel.Invoke for easy creation of simple tasks • Synchronous (blocking) APIs, but with cancellation support Parallel.Invoke( () => StatementA() , () => StatementB , () => StatementC() ); Parallel.For(…, cancellationToken);

  15. Parallel LINQ (PLINQ) • Enable LINQ developers to leverage parallel hardware • Supports all of the .NET Standard Query Operators • Plus a few other extension methods specific to PLINQ • Abstracts away parallelism details • Partitions and merges data intelligently (“classic” data parallelism) • Works for any IEnumerable<T> eg data.AsParallel().Select(..).Where(..); eg array.AsParallel().WithCancellation(ct)…

  16. Writing a PLINQ Query • Different ways to write PLINQ queries • Comprehensions • Syntax extensions to C# and Visual Basic • Normal APIs (two flavours) • Used as extension methods on IParallelEnumerable<T> • Direct use of ParallelEnumerable var q = from x in Y.AsParallel() where p(x) orderby x.f1 select x.f2; var q = Y.AsParallel() .Where(x => p(x)) .OrderBy(x => x.f1) .Select(x => x.f2); var q = ParallelEnumerable.Select( ParallelEnumerable.OrderBy( ParallelEnumerable.Where(Y.AsParallel(), x => p(x)), x => x.f1), x => x.f2);

  17. Plinq Partitioning and Merging • Input to a single operator is partitioned into p disjoint subsets • Operators are replicated across the partitions • A merge marshals data back to consumer thread foreach(int i in D.AsParallel() .where(x=>p(x)) .Select(x=> x*x*x) .OrderBy(x=>-x) • Each partition executes in (almost) complete isolation PLINQ … Task 1 … where p(x) select x3 LocalSort() D partition Merge foreach … Task n… where p(x) select x3 LocalSort()

  18. Coordination Data Structures • Used throughout PLINQ and TPL • Assist with key concurrency patterns • Thread-safe collections • ConcurrentStack<T> • ConcurrentQueue<T> • … • Work exchange • BlockingCollection<T> • … • Phased Operation • CountdownEvent • … • Locks and Signaling • ManualResetEventSlim • SemaphoreSlim • SpinLock … • Initialization • LazyInit<T> … • Cancellation • CancellationTokenSource • CancellationToken • OperationCanceledException

  19. Common Cancellation • A CancellationTokenSource is a source of cancellation requests. • A CancellationToken is a notifier of a cancellation request. • Linking tokens allows combining of cancellation requesters. • Slow code should poll every 1ms • Blocking calls should observe a Token. Workers… Get, share, and copy tokens Routinely poll token which observes CTS May attach callbacks to token Work co-ordinator Creates a CTS Starts work Cancels CTS if reqd CT CT CT CT CTS CT1 CTS12 CT CT2

  20. Common Cancellation (cont.) • All blocking calls allow a CancellationToken to be supplied. var results = data .AsParallel() .WithCancellation(token) .Select(x => f(x)) .ToArray(); • User code can observe the cancellation token, and cooperatively enact cancellation • var results = data .AsParallel() .WithCancellation(token) .Select(x => { if (token.IsCancellationRequested) throw new OperationCanceledEx(token); return f(x); } ) .ToArray();

  21. Extension Points in TPL & PLINQ • Partitioning strategies for Parallel & Plinq • Extend via Partitioner<T>, OrderablePartitioner<T>eg partitioners for heterogenous data. • TaskScheduling • Extend via TaskScheduler eg GUI-thread scheduler, throttled scheduler • BlockingCollection • extend via IProducerConsumerCollectioneg blocking priority queue.

  22. Debugging Parallel Apps in VS2010 • Two new debugger tool windows • “Parallel Tasks” • “Parallel Stacks” .

  23. Parallel Tasks Thread Assignment Location + Tooltip Status Parent ID Task Entry Point Identifier Current Task Task’s thread is frozen Column context menu Flagging . Tooltip shows info on waiting/deadlocked status Item context menu

  24. Parallel Stacks active frame of other thread(s) Context menu active frame of current thread current frame Zoom control method tooltip . header tooltip Bird’s eye view Blue highlights path of current thread

  25. Summary • The ManyCore Shift is happening • Parallelism in your code is inevitable • Invest in a platform that enables parallelism …like the Parallel Extensions for .NET

  26. Further Info and News MSDN Concurrency Developer Center http://msdn.microsoft.com/concurrency Getting the bits! June 2008 CTP - http://msdn.microsoft.com/concurrency Microsoft Visual Studio 2010 – Beta coming soon. http://www.microsoft.com/visualstudio/en-us/products/2010/default.mspx Parallel Extensions Team Blog http://blogs.msdn.com/pfxteam Blogs • Parallel Extensions Team http://blogs.msdn.com/pfxteam • Joe Duffy http://www.bluebytesoftware.com • Daniel Moth http://www.danielmoth.com/Blog/

  27. © 2008 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

  28. Extra Slides …

  29. Parallel Technologies from Microsoft Local computing • CDS • TPL • Plinq • Concurrency Runtime in Robotics Studio • PPL (Native) • OpenMP (Native) Distributed computing • WCF • MPI, MPI.NET

  30. Types Key Common Types: AggregateException, OperationCanceledException, TaskCanceledException CancellationTokenSource, CancellationToken Partitioner<T> Key TPL types: Task, Task<T> TaskFactory, TaskFactory<T> TaskScheduler Key Plinq types: Extension methods IEnumerable.AsParallel(), Ienumerable<T>.AsParallel () ParallelQuery, ParallelQuery<T>, OrderableParallelQuery<T> Key CDS types: Lazy<T>, LazyVariable<T>, LazyInitializer, CountdownEvent, ManualResetEventSlim, SemaphoreSlim BlockingCollection, ConcurrentDictionary, ConcurrentQueue

  31. Performance Tips • Early community technology preview • Keep in mind that performance will improve significantly • Compute intensive and/or large data sets • Work done should be at least 1,000s of cycles • Measure, and combine/optimize as necessary • Do not be gratuitous in task creation • Lightweight, but still requires object allocation, etc. • Parallelize only outer loops where possible • Unless N is insufficiently large to offer enough parallelism • Consider parallelizing only inner, or both, at that point • Prefer isolation and immutability over synchronization • Synchronization == !Scalable • Try to avoid shared data • Have realistic expectations • Amdahl’s Law • Speedup will be fundamentally limited by the amount of sequential computation • Gustafson’s Law • But what if you add more data, thus increasing the parallelizable percentage of the application?

  32. Parallelism Blockers int[] values = new int[] { 0, 1, 2 };var q = from x in values.AsParallel() select x * 2;int[] scaled = q.ToArray(); // == { 0, 2, 4 } ?? • Ordering not guaranteed • Exceptions • Thread affinity • Operations with sub-linear speedup, or even speedup < 1.0 • Side effects and mutability are serious issues • Most queries do not use side effects, but… • Race condition if non-unique elements AggregateException object[] data = new object[] { "foo", null, null };var q = from x in data.AsParallel() select o.ToString(); controls.AsParallel().ForAll(c => c.Size = ...); //Problem IEnumerable<int> input = …; var doubled = from x in input.AsParallel() select x*2; var q = from x in data.AsParallel() select x.f++;

  33. Plinq Partitioning, cont. • Types of partitioning • Chunk • Works with any IEnumerable<T> • Single enumerator shared; chunks handed out on-demand • Range • Works only with IList<T> • Input divided into contiguous regions, one per partition • Stride • Works only with IList<T> • Elements handed out round-robin to each partition • Hash • Works with any IEnumerable<T> • Elements assigned to partition based on hash code • Repartitioning sometimes necessary

  34. Plinq Merging • Pipelined: separate consumer thread • Default for GetEnumerator() • And hence foreach loops • Access to data as its available • But more synchronization overhead • Stop-and-go: consumer helps • Sorts, ToArray, ToList, GetEnumerator(false), etc. • Minimizes context switches • But higher latency and more memory • Inverted: no merging needed • ForAll extension method • Most efficient by far • But not always applicable Thread 2 Thread 1 Thread 1 Thread 3 Thread 4 Thread 1 Thread 1 Thread 1 Thread 2 Thread 3 Thread 1 Thread 1 Thread 1 Thread 2 Thread 3

  35. Example: “Baby Names” IEnumerable<BabyInfo> babyRecords = GetBabyRecords(); var results = new List<BabyInfo>(); foreach (varbabyRecord in babyRecords) { if (babyRecord.Name == queryName && babyRecord.State == queryState && babyRecord.Year >= yearStart && babyRecord.Year <= yearEnd) { results.Add(babyRecord); } } results.Sort((b1, b2) => b1.Year.CompareTo(b2.Year));

  36. Manual Parallel Solution Synchronization Knowledge IEnumerable<BabyInfo> babies = …; var results = new List<BabyInfo>(); int partitionsCount = Environment.ProcessorCount * 2; int remainingCount = partitionsCount; var enumerator = babies.GetEnumerator(); try { using (ManualResetEvent done = new ManualResetEvent(false)) { for (int i = 0; i < partitionsCount; i++) { ThreadPool.QueueUserWorkItem(delegate { varpartialResults = new List<BabyInfo>(); while(true) { BabyInfo baby; lock (enumerator) { if (!enumerator.MoveNext()) break; baby = enumerator.Current; } if (baby.Name == queryName && baby.State == queryState && baby.Year >= yearStart && baby.Year <= yearEnd) { partialResults.Add(baby); } } lock (results) results.AddRange(partialResults); if (Interlocked.Decrement(ref remainingCount) == 0) done.Set(); }); } done.WaitOne(); results.Sort((b1, b2) => b1.Year.CompareTo(b2.Year)); } } finally { if (enumerator is Idisposable) ((Idisposable)enumerator).Dispose(); } Inefficient locking Lack of foreach simplicity Manual aggregation Tricks Lack of thread reuse Heavy synchronization Non-parallel sort

  37. LINQ Solution .AsParallel() var results = from baby in babyRecords where baby.Name == queryName && baby.State == queryState && baby.Year >= yearStart && baby.Year <= yearEnd orderbybaby.Year ascending select baby; (or in different Syntax…) var results = babyRecords .Where(b => b.Name == queryName && b.State == queryState && b.Year >= yearStart && b.Year <= yearEnd) .OrderBy(b=>baby.Year) .Select(b=>b); .AsParallel()

  38. ThreadPool Task (Work) Stealing ThreadPool Task Queues … Worker Thread 1 Worker Thread p … Task 6 . Task 3 Task 4 Task 1 Program Thread Task 5 Task 2

More Related