1 / 40

Parallelism in the Standard C++: What to Expect in C++ 17

Parallelism in the Standard C++: What to Expect in C++ 17. Artur Laksberg arturl@microsoft.com Visual C++ Team, Microsoft September 17, 2014. Agenda. Parallel Fundamentals Task regions Parallel Algorithms Parallelization Vectorization. Part 1: The Fundamentals. Renderscript. OpenMP.

hayley-hill
Download Presentation

Parallelism in the Standard C++: What to Expect in C++ 17

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallelism in the Standard C++: What to Expect in C++ 17 Artur Laksberg arturl@microsoft.com Visual C++ Team, Microsoft September 17, 2014

  2. Agenda • Parallel Fundamentals • Task regions • Parallel Algorithms • Parallelization • Vectorization

  3. Part 1: The Fundamentals

  4. Renderscript OpenMP CUDA C++ AMP PPL TBB MPI OpenACC OpenCL Cilk Plus GCD

  5. Parallelism in C++11/14 • Fundamentals: • Memory model • Atomics • Basics: • thread • mutex • condition_variable • async • future

  6. Quicksort: Serial void quicksort(int *v, int start, int end) { if (start < end) { int pivot = partition(v, start, end); quicksort(v, start, pivot - 1); quicksort(v, pivot + 1, end); } }

  7. Quicksort: Use Threads Problem 1: expensive void quicksort(int *v, int start, int end) { if (start < end) { int pivot = partition(v, start, end); std::thread t1([&] { quicksort(v, start, pivot - 1); }); std::thread t2([&] { quicksort(v, pivot + 1, end); }); t1.join(); t2.join(); } } Problem 3: Exceptions?? Problem 2: Fork-join not enforced

  8. Andrzej Krzemieński:“Do not use naked threads in the program: use RAII-like wrappers instead”

  9. Quicksort: Fork-Join Parallelism parallel region void quicksort(int *v, int start, int end) { if (start < end) { int pivot = partition(v, start, end); quicksort(v, start, pivot - 1); quicksort(v, pivot + 1, end); } } task task

  10. Quicksort: Using Task Regions (N3832) parallel region void quicksort(int *v, int start, int end) { if (start < end) { task_region([&] (auto& r) { int pivot = partition(v, start, end); r.run([&] { quicksort(v, start, pivot - 1); }); r.run([&] { quicksort(v, pivot + 1, end); }); }); } } task task

  11. Under The Hood…

  12. Work Stealing Scheduling proc 2 proc 1 proc 3 proc 4

  13. Work Stealing Scheduling proc 2 proc 1 proc 3 proc 4 Old items New items

  14. Work Stealing Scheduling proc 2 proc 1 proc 3 proc 4 Old items New items

  15. Work Stealing Scheduling proc 2 proc 1 proc 3 proc 4 Old items New items

  16. Work Stealing Scheduling proc 2 proc 1 proc 3 proc 4 Old items New items “Thief”

  17. Fork-Join Parallelism and Work Stealing Q1: What thread runs f? Q2: What thread runs g? e(); task_region([] (auto& r) { r.run(f); g(); }); h(); e() g() f() Q3: What thread runs h? h()

  18. Work Stealing Design Choices • What Thread Executes After a Spawn? • Child Stealing • Continuation (parent) Stealing • What Thread Executes After a Join? • Stalling: initiating thread waits • Greedy: the last thread to reach join continues task_region([] (auto& r) { for(inti=0; i<n; ++i) r.run(f); });

  19. Part 2: The Algorithms

  20. Alex Stepanov: Start With The Algorithms

  21. Inspiration Performing Parallel Operations On Containers • Intel • Threading Building Blocks • Microsoft • Parallel Patterns Library, C++ AMP • Nvidia • Thrust

  22. Parallel STL • Just like STL, only parallel… • Can be faster • If you know what you’re doing • Two Execution Policies: • std:par • std::par_vec

  23. Parallelization: What’s a Big Deal? • Why not already parallel? std::sort(begin, end, [](int a, intb) { return a < b; }); • User-provided closures must be thread safe: int comparisons = 0; std::sort(begin, end, [&](int a, intb) { comparisons++; return a < b; }); • But also special-member functions, std::swap etc.

  24. It’s a Contract • What the user can do • What the implementer can do • Asymptotic Guarantees:std::sort: O(n*log(n)), std::stable_sort: O(n*log2(n)), what about parallel sort? • What is a valid implementation? (see next slide)

  25. Chaos Sort template<typename Iterator, typename Compare> void chaos_sort( Iterator first, Iterator last, Compare comp ) { auto n = last-first; std::vector<char> c(n); for(;;) { bool flag = false; for( size_ti=1; i<n; ++i ) { c[i] = comp(first[i],first[i-1]); flag |= c[i]; } if( !flag ) break; for( size_ti=1; i<n; ++i ) if( c[i] ) std::swap( first[i-1], first[i] ); } }

  26. Execution Policies • Built-in Execution Policies: extern constsequential_execution_policyseq; extern constparallel_execution_policy par; extern constparallel_vector_execution_policypar_vec; • Dynamic Execution Policy: class execution_policy { public: // ... consttype_info& target_type() const; template<class T> T *target(); template<class T> const T *target() const; };

  27. Using Execution Policy To Write Paralel Code std::vector<int> vec = ... // standard sequential sort std::sort(vec.begin(), vec.end()); using namespace std::experimental::parallel; // explicitly sequential sort sort(seq, vec.begin(), vec.end()); // permitting parallel execution sort(par, vec.begin(), vec.end()); // permitting vectorization as well sort(par_vec, vec.begin(), vec.end());

  28. Picking Execution Policy Dynamically size_tthreshold = ... execution_policyexec = seq; if(vec.size() > threshold) { exec = par; } sort(exec, vec.begin(), vec.end());

  29. Exception Handling • In C++ philosophy, no exception is silently ignored • Exception list: container of exception_ptr objects try { r = std::inner_product(std::par, a.begin(), a.end(), b.begin(), func1, func2, 0); } catch(constexception_list& list) { for(auto& exptr : list) { // process exception pointer exptr } }

  30. Vectorization: What’s a Big Deal? Move Unaligned Double Quadword int a[n] = ...; int b[n] = ...; for(int i=0; i<n; ++i) { a[i] = b[i] + c; } movdqu xmm1, XMMWORD PTR _b$[esp+eax+132] movdqu xmm0, XMMWORD PTR _a$[esp+eax+132] paddd xmm1, xmm2 paddd xmm1, xmm0 movdqu XMMWORD PTR _a$[esp+eax+132], xmm1 a[i:i+3] = b[i:i+3] + c;

  31. Vector Lane is not a Thread! • Taking locks • Thread with thread_id x takes a lock… • Then another “thread” with the same thread_id enters the lock… • Deadlock!!! • Exceptions • Can we unwind 1/4th of the stack?

  32. Vectorization: Not So Easy Any More… Aliasing? void f(int* a, int*b) { for(int i=0; i<n; ++i) { a[i] = b[i] + c; func(); } } mov ecx, DWORD PTR _b$[esp+esi+140] add ecx, edi add DWORD PTR _a$[esp+esi+140], ecx call func Side effects? Dependence? Exceptions?

  33. How Do We Get This? void f(float* a, float*b) { for(int i=0; i<n; ++i) { a[i] = b[i] + c; func(); } } for(int i=0; i<n; i+=4) { a[i:i+3] = b[i:i+3] + c; for(int j=0; j<4; ++j) func(); } Need a helping hand from the programmer!

  34. Vectorization Hazard: Locks Consider: f takes a lock, g releases the lock: ? for(int i=0; i<n; ++i) { lock.enter(); a[i] = b[i] + c; lock.release(); } for(int i=0; i<n; i+=4) { for(int j=0; j<4; ++j) lock.enter(); a[i:i+3] = b[i:i+3] + c; for(int j=0; j<4; ++j) lock.release(); } This transformation is not safe!

  35. But Wait, There Is One Little Problem… Index-based algorithm: Element-based algorithm: void f(float* a, float*b) { for(int i=0; i<n; ++i) { // OK: a[i] = b[i] + c; func(); } } void f(float* a, float*b) { std::for_each(a, b, [&](float f) { // Oops, no ‘i’: a[i] = b[i] + c; func(); }); }

  36. Vector Loop with Parallel STL void f(float* a, float*b) { integer_iteratorbegin {0}; // almost, see N3976 integer_iteratorend {b-a}; std::for_each( std::par_vec, begin, end, [&](inti) { a[i] = b[i] + c; func(); }); }

  37. Parallelization vs. Vectorization Parallelization Vectorization Vector Lanes No stack Lock-step execution Very light-weight • Threads • Stack • Good for divergent code • Relatively heavy-weight

  38. When To Vectorize std::par std::par_vec Same as std::par, plus: No Exceptions No Locks No/Little Divergence • No race conditions • No aliasing

  39. References • N3991: Task Region • N3872: A Primer on Scheduling Fork-Join Parallelism with Work Stealing • N3724: A Parallel Algorithms Library • N3989: Working Draft, Technical Specification for C++ Extensions for Parallelism • N3976 : Multidimensional bounds, index and array_view • parallelstl.codeplex.com

More Related