getting the most out of parallel extensions for net l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Getting the most out of Parallel Extensions for .NET PowerPoint Presentation
Download Presentation
Getting the most out of Parallel Extensions for .NET

Loading in 2 Seconds...

play fullscreen
1 / 38

Getting the most out of Parallel Extensions for .NET - PowerPoint PPT Presentation


  • 126 Views
  • Uploaded on

Getting the most out of Parallel Extensions for .NET. Dr. Mike Liddell Senior Developer Microsoft (mikelid@microsoft.com). Agenda. Why parallelism, why now? Parallelism with today’s technologies Parallel Extensions to the .NET Framework PLINQ Task Parallel Library

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Getting the most out of Parallel Extensions for .NET' - kory


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
getting the most out of parallel extensions for net
Getting the most out of Parallel Extensions for .NET

Dr. Mike Liddell

Senior Developer

Microsoft

(mikelid@microsoft.com)

agenda
Agenda
  • Why parallelism, why now?
  • Parallelism with today’s technologies
  • Parallel Extensions to the .NET Framework
    • PLINQ
    • Task Parallel Library
    • Coordination Data Structures
  • Demos
hardware paradigm shift

Sun’s Surface

10,000

1,000

100

10

1

Rocket Nozzle

Nuclear Reactor

Power Density (W/cm2)

8086

Hot Plate

4004

8085

Pentium® processors

8008

386

286

486

8080

‘70 ‘80 ‘90 ‘00 ‘10

Hardware Paradigm Shift

Today’s Architecture: Heat becoming an unmanageable problem!

To Grow, To Keep Up,

We Must Embrace Parallel Computing

32,768

2,048

128

16

Many-core Peak Parallel GOPs

Parallelism Opportunity

80X

GOPS

Single Threaded Perf 10% per year

2004 2006 2008 2010 2012 2015

Intel Developer Forum, Spring 2004 - Pat Gelsinger

“… we see a very significant shift in what architectures will look like in the future ...fundamentally the way we've begun to look at doing that is to move from instruction level concurrency to … multiple cores per die. But we're going to continue to go beyond there. And that just won't be in our server lines in the future; this will permeate every architecture that we build. All will have massivelymulticore implementations.”

Pat Gelsinger

Chief Technology Officer, Senior Vice President, Intel Corporation

slide4

It's An Industry Thing

  • Open MP
  • Intel TBB
  • Java libraries
  • Open CL
  • CUDA
  • MPI
  • Erlang
  • Cilk
  • (many others)
slide5

demo

  • Raytracer
what s the problem
What's the Problem?
  • Multithreaded programming is “hard” today
    • Robust solutions only by specialists
    • Parallel patterns are not prevalent, well known, nor easy to implement
    • Many potential correctness & performance issues
      • Races, deadlocks, livelocks, lock convoys, cache coherency overheads, missed notifications, non-serializable updates, priority inversion, false-sharing, sub-linear scaling and so on…
    • Features that can are often skimped on
      • Last delta of perf, ensuring no missed exceptions, composable cancellation, dynamic partitioning, efficient and custom scheduling
  • Businesses have little desire to “go deep”
    • Developers should focus on business value, not concurrency hassles and common concerns
example matrix multiplication
Example: Matrix Multiplication

voidMultiplyMatrices(int size, double[,] m1, double[,] m2, double[,] result)

{

for (int i = 0; i < size; i++) {

for (int j = 0; j < size; j++) {

result[i, j] = 0;

for (int k = 0; k < size; k++) {

result[i, j] += m1[i, k] * m2[k, j];

}

}

}

}

manual parallel solution
Manual Parallel Solution

Static Work Distribution

intN = size;

intP = 2 * Environment.ProcessorCount;

intChunk = N / P;

ManualResetEventsignal = newManualResetEvent(false);

intcounter = P;

for (intc = 0; c < P; c++) {

ThreadPool.QueueUserWorkItem(o => {

intlc = (int)o;

for(inti = lc * Chunk;

i < (lc + 1 == P ? N : (lc + 1) * Chunk);

i++) {

// original loop body

for(intj = 0; j < size; j++) {

result[i, j] = 0;

for(intk = 0; k < size; k++) {

result[i, j] += m1[i, k] * m2[k, j];

}

}

}

if(Interlocked.Decrement(refcounter) == 0) {

signal.Set();

}

}, c);

}

signal.WaitOne();

Potential scalability bottleneck

Error Prone

Error Prone

Manual locking

Manual Synchronization

parallel solution
Parallel Solution

voidMultiplyMatrices(int size, double[,] m1, double[,] m2, double[,] result)

{

Parallel.For(0, size, i => {

for (int j = 0; j < size; j++) {

result[i, j] = 0;

for (int k = 0; k < size; k++) {

result[i, j] += m1[i, k] * m2[k, j];

}

}

});

}

Demo!

parallel extensions to the net framework
Parallel Extensions to the .NET Framework
  • What is it?
    • Additional APIs shipping in .NET BCL (mscorlib, System, System.Core)
    • With corresponding enhancements to the CLR & ThreadPool
    • Provides primitives, task parallelism and data parallelism
      • Coordination/synchronization constructs (Coordination Data Structures)
      • Imperative data and task parallelism (Task Parallel Library)
      • Declarative data parallelism (PLINQ)
    • Common exception handling model
    • Common and rich cancellation model
  • Why do we need it?
    • Supports parallelism in any .NET language
    • Delivers reduced concept count and complexity, better time to solution
    • Begins to move parallelism capabilities from concurrency experts to domain experts
parallel extensions architecture
Parallel Extensions Architecture

User Code

Applications

PLINQ Execution Engine

Data Partitioning (Chunk, Range, Stripe, Custom)

Operators (Map, Filter, Sort, Search, Reduction

Merging (Pipeline, Synchronous, Order preserving)

Task Parallel Library

Coordination Data Structures

Thread-safe Collections

Coordination Types

Cancellation Types

Structured Task Parallelism

Pre-existing Primitives

ThreadPool

Monitor, Events, Threads

task parallel library
Task Parallel Library

1st-class debugger support!

  • System.Threading.Tasks
    • Task
      • Parent-child relationships
      • Structured waiting and cancellation
      • Continuations on succes, failure, cancellation
      • Implements IAsyncResult to compose with Async-Programming Model (APM).
    • Task<T>
      • A tasks that has a value on completion
      • Asynchronous execution with blocking on task.Value
      • Combines ideas of futures, and promises
    • TaskScheduler
      • We ship a scheduler that makes full use of the (vastly) improved ThreadPool
      • Custom Task Schedulers can be written for specific needs.
    • Parallel
      • Convenience APIs: Parallel.For(), Parallel.ForEach()
      • Automatic, scalable & dynamic partitioning.
task parallel library loops
Task Parallel LibraryLoops
  • Loops are a common source of work
  • Can be parallelized when iterations are independent
    • Body doesn’t depend on mutable state
      • e.g. static vars, writing to local vars to be used in subsequent iterations

for (int i = 0; i < n; i++) work(i);

foreach (T e in data) work(e);

Parallel.For(0, n, i => work(i));

Parallel.ForEach(data, e => work(e));

task parallel library14
Task Parallel Library
  • Supports early exit via a Break API
  • Parallel.For, Parallel.ForEach for loops.
  • Parallel.Invoke for easy creation of simple tasks
  • Synchronous (blocking) APIs, but with cancellation support

Parallel.Invoke(

() => StatementA() ,

() => StatementB ,

() => StatementC() );

Parallel.For(…, cancellationToken);

parallel linq plinq
Parallel LINQ (PLINQ)
  • Enable LINQ developers to leverage parallel hardware
    • Supports all of the .NET Standard Query Operators
      • Plus a few other extension methods specific to PLINQ
    • Abstracts away parallelism details
      • Partitions and merges data intelligently (“classic” data parallelism)
    • Works for any IEnumerable<T>

eg data.AsParallel().Select(..).Where(..);

eg array.AsParallel().WithCancellation(ct)…

writing a plinq query
Writing a PLINQ Query
  • Different ways to write PLINQ queries
    • Comprehensions
      • Syntax extensions to C# and Visual Basic
    • Normal APIs (two flavours)
      • Used as extension methods on IParallelEnumerable<T>
      • Direct use of ParallelEnumerable

var q = from x in Y.AsParallel() where p(x) orderby x.f1 select x.f2;

var q = Y.AsParallel()

.Where(x => p(x))

.OrderBy(x => x.f1)

.Select(x => x.f2);

var q = ParallelEnumerable.Select(

ParallelEnumerable.OrderBy(

ParallelEnumerable.Where(Y.AsParallel(), x => p(x)),

x => x.f1),

x => x.f2);

plinq partitioning and merging
Plinq Partitioning and Merging
  • Input to a single operator is partitioned into p disjoint subsets
  • Operators are replicated across the partitions
  • A merge marshals data back to consumer thread

foreach(int i in D.AsParallel()

.where(x=>p(x))

.Select(x=> x*x*x)

.OrderBy(x=>-x)

  • Each partition executes in (almost) complete isolation

PLINQ

… Task 1 …

where p(x)

select x3

LocalSort()

D

partition

Merge

foreach

… Task n…

where p(x)

select x3

LocalSort()

coordination data structures
Coordination Data Structures
  • Used throughout PLINQ and TPL
  • Assist with key concurrency patterns
  • Thread-safe collections
    • ConcurrentStack<T>
    • ConcurrentQueue<T>
  • Work exchange
    • BlockingCollection<T>
  • Phased Operation
    • CountdownEvent
  • Locks and Signaling
    • ManualResetEventSlim
    • SemaphoreSlim
    • SpinLock …
  • Initialization
    • LazyInit<T> …
  • Cancellation
    • CancellationTokenSource
    • CancellationToken
    • OperationCanceledException
common cancellation
Common Cancellation
  • A CancellationTokenSource is a source of cancellation requests.
  • A CancellationToken is a notifier of a cancellation request.
  • Linking tokens allows combining of cancellation requesters.
  • Slow code should poll every 1ms
  • Blocking calls should observe a Token.

Workers…

Get, share, and copy tokens

Routinely poll token which observes CTS

May attach callbacks to token

Work co-ordinator

Creates a CTS

Starts work

Cancels CTS if reqd

CT

CT

CT

CT

CTS

CT1

CTS12

CT

CT2

common cancellation cont
Common Cancellation (cont.)
  • All blocking calls allow a CancellationToken to be supplied.

var results = data .AsParallel() .WithCancellation(token) .Select(x => f(x)) .ToArray();

  • User code can observe the cancellation token, and cooperatively enact cancellation
  • var results = data .AsParallel() .WithCancellation(token) .Select(x => {

if (token.IsCancellationRequested)

throw new OperationCanceledEx(token);

return f(x);

}

) .ToArray();

extension points in tpl plinq
Extension Points in TPL & PLINQ
  • Partitioning strategies for Parallel & Plinq
    • Extend via Partitioner<T>, OrderablePartitioner<T>eg partitioners for heterogenous data.
  • TaskScheduling
    • Extend via TaskScheduler eg GUI-thread scheduler, throttled scheduler
  • BlockingCollection
    • extend via IProducerConsumerCollectioneg blocking priority queue.
debugging parallel apps in vs2010
Debugging Parallel Apps in VS2010
  • Two new debugger tool windows
    • “Parallel Tasks”
    • “Parallel Stacks”

.

slide23

Parallel Tasks

Thread Assignment

Location + Tooltip

Status

Parent ID

Task Entry Point

Identifier

Current Task

Task’s thread

is frozen

Column context menu

Flagging

.

Tooltip shows info on waiting/deadlocked status

Item context menu

slide24

Parallel Stacks

active frame of

other thread(s)

Context menu

active frame of

current thread

current frame

Zoom

control

method tooltip

.

header tooltip

Bird’s eye view

Blue highlights path of current thread

slide25

Summary

  • The ManyCore Shift is happening
  • Parallelism in your code is inevitable
  • Invest in a platform that enables parallelism

…like the Parallel Extensions for .NET

further info and news
Further Info and News

MSDN Concurrency Developer Center

http://msdn.microsoft.com/concurrency

Getting the bits!

June 2008 CTP - http://msdn.microsoft.com/concurrency

Microsoft Visual Studio 2010 – Beta coming soon.

http://www.microsoft.com/visualstudio/en-us/products/2010/default.mspx

Parallel Extensions Team Blog

http://blogs.msdn.com/pfxteam

Blogs

  • Parallel Extensions Team http://blogs.msdn.com/pfxteam
  • Joe Duffy http://www.bluebytesoftware.com
  • Daniel Moth http://www.danielmoth.com/Blog/
slide27

© 2008 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.

The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

parallel technologies from microsoft
Parallel Technologies from Microsoft

Local computing

  • CDS
  • TPL
  • Plinq
  • Concurrency Runtime in Robotics Studio
  • PPL (Native)
  • OpenMP (Native)

Distributed computing

  • WCF
  • MPI, MPI.NET
types
Types

Key Common Types:

AggregateException, OperationCanceledException, TaskCanceledException

CancellationTokenSource, CancellationToken

Partitioner<T>

Key TPL types:

Task, Task<T>

TaskFactory, TaskFactory<T>

TaskScheduler

Key Plinq types:

Extension methods IEnumerable.AsParallel(), Ienumerable<T>.AsParallel ()

ParallelQuery, ParallelQuery<T>, OrderableParallelQuery<T>

Key CDS types:

Lazy<T>, LazyVariable<T>, LazyInitializer,

CountdownEvent, ManualResetEventSlim, SemaphoreSlim

BlockingCollection, ConcurrentDictionary, ConcurrentQueue

performance tips
Performance Tips
  • Early community technology preview
    • Keep in mind that performance will improve significantly
  • Compute intensive and/or large data sets
    • Work done should be at least 1,000s of cycles
      • Measure, and combine/optimize as necessary
  • Do not be gratuitous in task creation
    • Lightweight, but still requires object allocation, etc.
  • Parallelize only outer loops where possible
    • Unless N is insufficiently large to offer enough parallelism
      • Consider parallelizing only inner, or both, at that point
  • Prefer isolation and immutability over synchronization
    • Synchronization == !Scalable
      • Try to avoid shared data
  • Have realistic expectations
    • Amdahl’s Law
      • Speedup will be fundamentally limited by the amount of sequential computation
    • Gustafson’s Law
      • But what if you add more data, thus increasing the parallelizable percentage of the application?
parallelism blockers
Parallelism Blockers

int[] values = new int[] { 0, 1, 2 };var q = from x in values.AsParallel() select x * 2;int[] scaled = q.ToArray(); // == { 0, 2, 4 } ??

  • Ordering not guaranteed
  • Exceptions
  • Thread affinity
  • Operations with sub-linear speedup, or even speedup < 1.0
  • Side effects and mutability are serious issues
    • Most queries do not use side effects, but…
      • Race condition if non-unique elements

AggregateException

object[] data = new object[] { "foo", null, null };var q = from x in data.AsParallel() select o.ToString();

controls.AsParallel().ForAll(c => c.Size = ...); //Problem

IEnumerable<int> input = …;

var doubled = from x in input.AsParallel() select x*2;

var q = from x in data.AsParallel() select x.f++;

plinq partitioning cont
Plinq Partitioning, cont.
  • Types of partitioning
    • Chunk
      • Works with any IEnumerable<T>
      • Single enumerator shared; chunks handed out on-demand
    • Range
      • Works only with IList<T>
      • Input divided into contiguous regions, one per partition
    • Stride
      • Works only with IList<T>
      • Elements handed out round-robin to each partition
    • Hash
      • Works with any IEnumerable<T>
      • Elements assigned to partition based on hash code
  • Repartitioning sometimes necessary
plinq merging
Plinq Merging
  • Pipelined: separate consumer thread
    • Default for GetEnumerator()
      • And hence foreach loops
    • Access to data as its available
      • But more synchronization overhead
  • Stop-and-go: consumer helps
    • Sorts, ToArray, ToList, GetEnumerator(false), etc.
    • Minimizes context switches
      • But higher latency and more memory
  • Inverted: no merging needed
    • ForAll extension method
    • Most efficient by far
      • But not always applicable

Thread 2

Thread 1

Thread 1

Thread 3

Thread 4

Thread 1

Thread 1

Thread 1

Thread 2

Thread 3

Thread 1

Thread 1

Thread 1

Thread 2

Thread 3

example baby names
Example: “Baby Names”

IEnumerable<BabyInfo> babyRecords = GetBabyRecords();

var results = new List<BabyInfo>();

foreach (varbabyRecord in babyRecords)

{

if (babyRecord.Name == queryName &&

babyRecord.State == queryState &&

babyRecord.Year >= yearStart &&

babyRecord.Year <= yearEnd)

{

results.Add(babyRecord);

}

}

results.Sort((b1, b2) => b1.Year.CompareTo(b2.Year));

manual parallel solution36
Manual Parallel Solution

Synchronization Knowledge

IEnumerable<BabyInfo> babies = …;

var results = new List<BabyInfo>();

int partitionsCount = Environment.ProcessorCount * 2;

int remainingCount = partitionsCount;

var enumerator = babies.GetEnumerator();

try {

using (ManualResetEvent done = new ManualResetEvent(false)) {

for (int i = 0; i < partitionsCount; i++) {

ThreadPool.QueueUserWorkItem(delegate {

varpartialResults = new List<BabyInfo>();

while(true) {

BabyInfo baby;

lock (enumerator) {

if (!enumerator.MoveNext()) break;

baby = enumerator.Current;

}

if (baby.Name == queryName && baby.State == queryState &&

baby.Year >= yearStart && baby.Year <= yearEnd) {

partialResults.Add(baby);

}

}

lock (results) results.AddRange(partialResults);

if (Interlocked.Decrement(ref remainingCount) == 0) done.Set();

});

}

done.WaitOne();

results.Sort((b1, b2) => b1.Year.CompareTo(b2.Year));

}

}

finally { if (enumerator is Idisposable) ((Idisposable)enumerator).Dispose(); }

Inefficient locking

Lack of foreach simplicity

Manual aggregation

Tricks

Lack of thread reuse

Heavy synchronization

Non-parallel sort

linq solution
LINQ Solution

.AsParallel()

var results = from baby in babyRecords

where baby.Name == queryName &&

baby.State == queryState &&

baby.Year >= yearStart &&

baby.Year <= yearEnd

orderbybaby.Year ascending

select baby;

(or in different Syntax…)

var results = babyRecords

.Where(b => b.Name == queryName &&

b.State == queryState &&

b.Year >= yearStart &&

b.Year <= yearEnd)

.OrderBy(b=>baby.Year)

.Select(b=>b);

.AsParallel()

threadpool task work stealing
ThreadPool Task (Work) Stealing

ThreadPool Task Queues

Worker Thread 1

Worker Thread p

Task 6

.

Task 3

Task 4

Task 1

Program Thread

Task 5

Task 2