Parallelism in the Standard C++: What to Expect in C++ 17

Parallelism in the Standard C++: What to Expect in C++ 17 Artur Laksberg arturl@microsoft.com Visual C++ Team, Microsoft September 17, 2014

Agenda • Parallel Fundamentals • Task regions • Parallel Algorithms • Parallelization • Vectorization

Part 1: The Fundamentals

Renderscript OpenMP CUDA C++ AMP PPL TBB MPI OpenACC OpenCL Cilk Plus GCD

Parallelism in C++11/14 • Fundamentals: • Memory model • Atomics • Basics: • thread • mutex • condition_variable • async • future

Quicksort: Serial void quicksort(int *v, int start, int end) { if (start < end) { int pivot = partition(v, start, end); quicksort(v, start, pivot - 1); quicksort(v, pivot + 1, end); } }

Quicksort: Use Threads Problem 1: expensive void quicksort(int *v, int start, int end) { if (start < end) { int pivot = partition(v, start, end); std::thread t1([&] { quicksort(v, start, pivot - 1); }); std::thread t2([&] { quicksort(v, pivot + 1, end); }); t1.join(); t2.join(); } } Problem 3: Exceptions?? Problem 2: Fork-join not enforced

Andrzej Krzemieński:“Do not use naked threads in the program: use RAII-like wrappers instead”

Quicksort: Fork-Join Parallelism parallel region void quicksort(int *v, int start, int end) { if (start < end) { int pivot = partition(v, start, end); quicksort(v, start, pivot - 1); quicksort(v, pivot + 1, end); } } task task

Quicksort: Using Task Regions (N3832) parallel region void quicksort(int *v, int start, int end) { if (start < end) { task_region([&] (auto& r) { int pivot = partition(v, start, end); r.run([&] { quicksort(v, start, pivot - 1); }); r.run([&] { quicksort(v, pivot + 1, end); }); }); } } task task

Under The Hood…

Work Stealing Scheduling proc 2 proc 1 proc 3 proc 4

Work Stealing Scheduling proc 2 proc 1 proc 3 proc 4 Old items New items

Work Stealing Scheduling proc 2 proc 1 proc 3 proc 4 Old items New items “Thief”

Fork-Join Parallelism and Work Stealing Q1: What thread runs f? Q2: What thread runs g? e(); task_region([] (auto& r) { r.run(f); g(); }); h(); e() g() f() Q3: What thread runs h? h()

Work Stealing Design Choices • What Thread Executes After a Spawn? • Child Stealing • Continuation (parent) Stealing • What Thread Executes After a Join? • Stalling: initiating thread waits • Greedy: the last thread to reach join continues task_region([] (auto& r) { for(inti=0; i<n; ++i) r.run(f); });

Part 2: The Algorithms

Alex Stepanov: Start With The Algorithms

Inspiration Performing Parallel Operations On Containers • Intel • Threading Building Blocks • Microsoft • Parallel Patterns Library, C++ AMP • Nvidia • Thrust

Parallel STL • Just like STL, only parallel… • Can be faster • If you know what you’re doing • Two Execution Policies: • std:par • std::par_vec

Parallelization: What’s a Big Deal? • Why not already parallel? std::sort(begin, end, [](int a, intb) { return a < b; }); • User-provided closures must be thread safe: int comparisons = 0; std::sort(begin, end, [&](int a, intb) { comparisons++; return a < b; }); • But also special-member functions, std::swap etc.

It’s a Contract • What the user can do • What the implementer can do • Asymptotic Guarantees:std::sort: O(n*log(n)), std::stable_sort: O(n*log2(n)), what about parallel sort? • What is a valid implementation? (see next slide)

Chaos Sort template<typename Iterator, typename Compare> void chaos_sort( Iterator first, Iterator last, Compare comp ) { auto n = last-first; std::vector<char> c(n); for(;;) { bool flag = false; for( size_ti=1; i<n; ++i ) { c[i] = comp(first[i],first[i-1]); flag |= c[i]; } if( !flag ) break; for( size_ti=1; i<n; ++i ) if( c[i] ) std::swap( first[i-1], first[i] ); } }

Execution Policies • Built-in Execution Policies: extern constsequential_execution_policyseq; extern constparallel_execution_policy par; extern constparallel_vector_execution_policypar_vec; • Dynamic Execution Policy: class execution_policy { public: // ... consttype_info& target_type() const; template<class T> T *target(); template<class T> const T *target() const; };

Using Execution Policy To Write Paralel Code std::vector<int> vec = ... // standard sequential sort std::sort(vec.begin(), vec.end()); using namespace std::experimental::parallel; // explicitly sequential sort sort(seq, vec.begin(), vec.end()); // permitting parallel execution sort(par, vec.begin(), vec.end()); // permitting vectorization as well sort(par_vec, vec.begin(), vec.end());

Picking Execution Policy Dynamically size_tthreshold = ... execution_policyexec = seq; if(vec.size() > threshold) { exec = par; } sort(exec, vec.begin(), vec.end());

Exception Handling • In C++ philosophy, no exception is silently ignored • Exception list: container of exception_ptr objects try { r = std::inner_product(std::par, a.begin(), a.end(), b.begin(), func1, func2, 0); } catch(constexception_list& list) { for(auto& exptr : list) { // process exception pointer exptr } }

Vectorization: What’s a Big Deal? Move Unaligned Double Quadword int a[n] = ...; int b[n] = ...; for(int i=0; i<n; ++i) { a[i] = b[i] + c; } movdqu xmm1, XMMWORD PTR _b$[esp+eax+132] movdqu xmm0, XMMWORD PTR _a$[esp+eax+132] paddd xmm1, xmm2 paddd xmm1, xmm0 movdqu XMMWORD PTR _a$[esp+eax+132], xmm1 a[i:i+3] = b[i:i+3] + c;

Vector Lane is not a Thread! • Taking locks • Thread with thread_id x takes a lock… • Then another “thread” with the same thread_id enters the lock… • Deadlock!!! • Exceptions • Can we unwind 1/4th of the stack?

Vectorization: Not So Easy Any More… Aliasing? void f(int* a, int*b) { for(int i=0; i<n; ++i) { a[i] = b[i] + c; func(); } } mov ecx, DWORD PTR _b$[esp+esi+140] add ecx, edi add DWORD PTR _a$[esp+esi+140], ecx call func Side effects? Dependence? Exceptions?

How Do We Get This? void f(float* a, float*b) { for(int i=0; i<n; ++i) { a[i] = b[i] + c; func(); } } for(int i=0; i<n; i+=4) { a[i:i+3] = b[i:i+3] + c; for(int j=0; j<4; ++j) func(); } Need a helping hand from the programmer!

Vectorization Hazard: Locks Consider: f takes a lock, g releases the lock: ? for(int i=0; i<n; ++i) { lock.enter(); a[i] = b[i] + c; lock.release(); } for(int i=0; i<n; i+=4) { for(int j=0; j<4; ++j) lock.enter(); a[i:i+3] = b[i:i+3] + c; for(int j=0; j<4; ++j) lock.release(); } This transformation is not safe!

But Wait, There Is One Little Problem… Index-based algorithm: Element-based algorithm: void f(float* a, float*b) { for(int i=0; i<n; ++i) { // OK: a[i] = b[i] + c; func(); } } void f(float* a, float*b) { std::for_each(a, b, [&](float f) { // Oops, no ‘i’: a[i] = b[i] + c; func(); }); }

Vector Loop with Parallel STL void f(float* a, float*b) { integer_iteratorbegin {0}; // almost, see N3976 integer_iteratorend {b-a}; std::for_each( std::par_vec, begin, end, [&](inti) { a[i] = b[i] + c; func(); }); }

Parallelization vs. Vectorization Parallelization Vectorization Vector Lanes No stack Lock-step execution Very light-weight • Threads • Stack • Good for divergent code • Relatively heavy-weight

When To Vectorize std::par std::par_vec Same as std::par, plus: No Exceptions No Locks No/Little Divergence • No race conditions • No aliasing

References • N3991: Task Region • N3872: A Primer on Scheduling Fork-Join Parallelism with Work Stealing • N3724: A Parallel Algorithms Library • N3989: Working Draft, Technical Specification for C++ Extensions for Parallelism • N3976 : Multidimensional bounds, index and array_view • parallelstl.codeplex.com

Parallelism in the Standard C++: What to Expect in C++ 17