1 / 49

Multi-core Programming

Multi-core Programming. Threading Concepts. Topics. A Generic Development Cycle Case Study: Prime Number Generation Common Performance Issues. What is Parallelism?. Two or more processes or threads execute at the same time Parallelism for threading architectures Multiple processes

freya
Download Presentation

Multi-core Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multi-core Programming Threading Concepts

  2. Topics • A Generic Development Cycle • Case Study: Prime Number Generation • Common Performance Issues Basics of VTune™ Performance Analyzer

  3. What is Parallelism? • Two or more processes or threads execute at the same time • Parallelism for threading architectures • Multiple processes • Communication through Inter-Process Communication (IPC) • Single process, multiple threads • Communication through shared memory Threaded Programming Methodology

  4. n = number of processors Tparallel = {(1-P) + P/n}Tserial 0.5 + 0.25 0.5 + 0.0 P Speedup = Tserial / Tparallel P/2 serial … T P/∞ 1.0/0.5 = 2.0 1.0/0.75 = 1.33 (1-P) (1-P) Amdahl’s Law Describes the upper bound of parallel execution speedup n = ∞ n = 2 P - proportion of a computation where parallelization may occur.. Serial code limits speedup Threaded Programming Methodology

  5. Stack Code segment Data segment Stack Stack thread thread … Processes and Threads • Modern operating systems load programs as processes • Resource holder • Execution • A process starts executing at its entry point as a thread • Threads can create other threads within the process • Each thread gets its own stack • All threads within a process share code & data segments thread main() Threaded Programming Methodology

  6. Threads – Benefits & Risks • Benefits • Increased performance and better resource utilization • Even on single processor systems - for hiding latency and increasing throughput • IPC through shared memory is more efficient • Risks • Increases complexity of the application • Difficult to debug (data races, deadlocks, etc.) Threaded Programming Methodology

  7. Commonly Encountered Questions with Threading Applications • Where to thread? • How long would it take to thread? • How much re-design/effort is required? • Is it worth threading a selected region? • What should the expected speedup be? • Will the performance meet expectations? • Will it scale as more threads/data are added? • Which threading model to use? Threaded Programming Methodology

  8. i factor Prime Number Generation bool TestForPrime(int val) { // let’s start checking from 3 int limit, factor = 3; limit = (long)(sqrtf((float)val)+0.5f); while( (factor <= limit) && (val % factor) ) factor ++; return (factor > limit); } void FindPrimes(int start, int end) { int range = end - start + 1; for( int i = start; i <= end; i += 2 ) { if( TestForPrime(i) ) globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); } } 61 3 5 7 633 65 3 5 67 3 5 7 693 71 3 5 7 73 3 5 7 9 75 3 5 77 3 5 7 79 3 5 7 9 Threaded Programming Methodology

  9. Development Methodology • Analysis • Find computationally intense code • Design (Introduce Threads) • Determine how to implement threading solution • Debug for correctness • Detect any problems resulting from using threads • Tune for performance • Achieve best parallel performance Threaded Programming Methodology

  10. Development Cycle Analysis • VTune™ Performance Analyzer • Design (Introduce Threads) • Intel® Performance libraries: IPP and MKL • OpenMP* (Intel® Compiler) • Explicit threading (Win32*, Pthreads*) Debug for correctness • Intel® Thread Checker • Intel Debugger Tune for performance • Intel® Thread Profiler • VTune™ Performance Analyzer Threaded Programming Methodology

  11. Analysis - Sampling boolTestForPrime(intval) { // let’s start checking from 3 int limit, factor = 3; limit = (long)(sqrtf((float)val)+0.5f); while( (factor <= limit) && (val % factor)) factor ++; return (factor > limit); } void FindPrimes(int start, int end) { // start is always odd int range = end - start + 1; for( inti = start; i <= end; i+= 2 ){ if( TestForPrime(i) ) globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); } } • Use VTune Sampling to find hotspots in application • Let’s use the project PrimeSingle for analysis • PrimeSingle <start> <end> • Usage:./PrimeSingle 1 1000000 Identifies the time consuming regions Threaded Programming Methodology

  12. This is the level in the call tree where we need to thread Analysis - Call Graph Used to find proper level in the call-tree to thread Threaded Programming Methodology

  13. Baseline measurement Analysis • Where to thread? • FindPrimes() • Is it worth threading a selected region? • Appears to have minimal dependencies • Appears to be data-parallel • Consumes over 95% of the run time Threaded Programming Methodology

  14. Foster’s Design Methodology • From Designing and Building Parallel Programs by Ian Foster • Four Steps: • Partitioning • Dividing computation and data • Communication • Sharing data between computations • Agglomeration • Grouping tasks to improve performance • Mapping • Assigning tasks to processors/threads Threaded Programming Methodology

  15. The Problem Initial tasks Communication Combined Tasks Designing Threaded Programs • Partition • Divide problem into tasks • Communicate • Determine amount and pattern of communication • Agglomerate • Combine tasks • Map • Assign agglomerated tasks to created threads Final Program Threaded Programming Methodology

  16. Parallel Programming Models • Functional Decomposition • Task parallelism • Divide the computation, then associate the data • Independent tasks of the same problem • Data Decomposition • Same operation performed on different data • Divide data into pieces, then associate computation Threaded Programming Methodology

  17. Hydrology Model Ocean Model Grid reprinted with permission of Dr. Phu V. Luong, Coastal and Hydraulics Laboratory, ERDC Land Surface Model Atmosphere Model Decomposition Methods • Functional Decomposition • Focusing on computations can reveal structure in a problem • Domain Decomposition • Focus on largest or most frequently accessed data structure • Data Parallelism • Same operation applied to all data Threaded Programming Methodology

  18. Pipelined Decomposition • Computation done in independent stages • Functional decomposition • Threads are assigned stage to compute • Automobile assembly line • Data decomposition • Thread processes all stages of single instance • One worker builds an entire car Threaded Programming Methodology

  19. Hierarchical Barrier T 1 Prelude N Prelude N+1 Prelude N+2 Prelude N+3 Add frame header Check correctness Write to disk Fetch next frame Frame characterization Set encode parameters Psycho Analysis FFT long/short Filter assemblage Apply filtering Noise Shaping Quantize & Count bits T 2 Acoustics N Acoustics N+1 Acoustics N+2 T 3 Encoding N Encoding N+1 Time Other N+1 Other N T 4 Frame N Frame N + 1 LAME Pipeline Strategy Prelude Acoustics Encoding Other Frame Threaded Programming Methodology

  20. Design • What is the expected benefit? • How do you achieve this with the least effort? • How long would it take to thread? • How much re-design/effort is required? Speedup(2P) = 100/(96/2+4) = ~1.92X Rapid prototyping with OpenMP Threaded Programming Methodology

  21. Master Thread Parallel Regions OpenMP • Fork-join parallelism: • Master thread spawns a team of threads as needed • Parallelism is added incrementally • Sequential program evolves into a parallel program Threaded Programming Methodology

  22. OpenMP Divide iterations of the for loop Create threads here for this parallel region Design #pragmaomp parallel for for( inti = start; i <= end; i+= 2 ){ if( TestForPrime(i) ) globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); } Threaded Programming Methodology

  23. Design • What is the expected benefit? • How do you achieve this with the least effort? • How long would it take to thread? • How much re-design/effort is required? • Is this the best speedup possible? Speedup of 1.40X (less than 1.92X) Threaded Programming Methodology

  24. Debugging for Correctness • Is this threaded implementation right? • No! The answers are different each time … Threaded Programming Methodology

  25. Debugging for Correctness Intel® Thread Checker pinpoints notorious threading bugs like data races, stalls and deadlocks VTune™ Performance Analyzer Intel® Thread Checker Primes.exe (Instrumented) Binary Instrumentation Primes.exe Runtime Data Collector +DLLs (Instrumented) threadchecker.thr (result file) Threaded Programming Methodology

  26. Thread Checker Threaded Programming Methodology

  27. Debugging for Correctness • How much re-design/effort is required? • How long would it take to thread? Thread Checker reported only 2 dependencies, so effort required should be low Threaded Programming Methodology

  28. Will create a critical section for this reference Will create a critical section for both these references Debugging for Correctness #pragmaomp parallel for for( inti = start; i <= end; i+= 2 ){ if( TestForPrime(i) ) #pragmaomp critical globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); } #pragmaomp critical { gProgress++; percentDone = (int)(gProgress/range *200.0f+0.5f) } Threaded Programming Methodology

  29. Correctness • Correct answer, but performance has slipped to ~1.33X • Is this the best we can expect from this algorithm? No! From Amdahl’s Law, we expect speedup close to 1.9X Threaded Programming Methodology

  30. Common Performance Issues • Parallel Overhead • Due to thread creation, scheduling … • Synchronization • Excessive use of global data, contention for the same synchronization object • Load Imbalance • Improper distribution of parallel work • Granularity • No sufficient parallel work Threaded Programming Methodology

  31. VTune™ Performance Analyzer Thread Profiler Tuning for Performance Thread Profiler pinpoints performance bottlenecks in threaded applications Primes.c Primes.exe (Instrumented) Binary Instrumentation Compiler Source Instrumentation /Qopenmp_profile Runtime Data Collector +DLL’s (Instrumented) Primes.exe Bistro.tp/guide.gvs (result file) Threaded Programming Methodology

  32. Thread Profiler for OpenMP Threaded Programming Methodology

  33. Thread Profiler for OpenMP • Speedup Graph • Estimates threading speedup and potential speedup • Based on Amdahl’s Law computation • Gives upper and lower bound estimates Threaded Programming Methodology

  34. serial serial parallel Thread Profiler for OpenMP Threaded Programming Methodology

  35. Thread 0 Thread 1 Thread 2 Thread 3 Thread Profiler for OpenMP Threaded Programming Methodology

  36. Thread Profiler (for Explicit Threads) Threaded Programming Methodology

  37. Thread Profiler (for Explicit Threads) Why so many transitions? Threaded Programming Methodology

  38. Performance • This implementation has implicit synchronization calls • This limits scaling performance due to the resulting context switches Back to the design stage Threaded Programming Methodology

  39. Performance void ShowProgress( int val, int range ) { int percentDone; static int lastPercentDone = 0; #pragma omp critical { gProgress++; percentDone = (int)((float)gProgress/(float)range*200.0f+0.5f); } if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10){ printf("\b\b\b\b%3d%%", percentDone); lastPercentDone++; } } Is that much contention expected? The algorithm has many more updates than the 10 needed for showing progress void ShowProgress( int val, int range ) { int percentDone; gProgress++; percentDone = (int)((float)gProgress/(float)range*200.0f+0.5f); if( percentDone % 10 == 0 ) printf("\b\b\b\b%3d%%", percentDone); } This change should fix the contention issue Threaded Programming Methodology

  40. Design • Goals • Eliminate the contention due to implicit synchronization Speedup is 2.32X ! Is that right? Threaded Programming Methodology

  41. Performance • Our original baseline measurement had the “flawed” progress update algorithm • Is this the best we can expect from this algorithm? Speedup is actually1.40X (<<1.9X)! Threaded Programming Methodology

  42. Performance Re-visited Still have 62% of execution time in locks and synchronization Threaded Programming Methodology

  43. Lock is in a loop Performance Re-visited Let’s look at the OpenMP locks… void FindPrimes(int start, int end) { // start is always odd int range = end - start + 1; #pragma omp parallel for for( int i = start; i <= end; i += 2 ) { if( TestForPrime(i) ) #pragma omp critical globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); } } void FindPrimes(int start, int end) { // start is always odd int range = end - start + 1; #pragma omp parallel for for( int i = start; i <= end; i += 2 ) { if( TestForPrime(i) ) globalPrimes[InterlockedIncrement(&gPrimesFound)] = i; ShowProgress(i, range); } } Threaded Programming Methodology

  44. This lock is also being called within a loop Performance Re-visited Let’s look at the second lock void ShowProgress( int val, int range ) { int percentDone; static int lastPercentDone = 0; #pragma omp critical { gProgress++; percentDone = (int)((float)gProgress/(float)range*200.0f+0.5f); } if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10){ printf("\b\b\b\b%3d%%", percentDone); lastPercentDone++; } } void ShowProgress( int val, int range ) { long percentDone, localProgress; static int lastPercentDone = 0; localProgress = InterlockedIncrement(&gProgress); percentDone = (int)((float)localProgress/(float)range*200.0f+0.5f); if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10){ printf("\b\b\b\b%3d%%", percentDone); lastPercentDone++; } } Threaded Programming Methodology

  45. 250000 500000 750000 1000000 Thread Profiler for OpenMP Thread 0 342 factors to test 116747 Thread 1 612 factors to test 373553 Thread 2 789 factors to test 623759 Thread 3 934 factors to test 873913

  46. Fixing the Load Imbalance Distribute the work more evenly void FindPrimes(int start, int end) { // start is always odd int range = end - start + 1; #pragma omp parallel for schedule(static, 8) for( int i = start; i <= end; i += 2 ) { if( TestForPrime(i) ) globalPrimes[InterlockedIncrement(&gPrimesFound)] = i; ShowProgress(i, range); } } Speedup achieved is 1.68X Threaded Programming Methodology

  47. Final Thread Profiler Run Speedup achieved is 1.80X Threaded Programming Methodology

  48. Comparative Analysis Threading applications require multiple iterations of going through the software development cycle Threaded Programming Methodology

  49. Threading MethodologyWhat’s Been Covered • Four step development cycle for writing threaded code from serial and the Intel® tools that support each step • Analysis • Design (Introduce Threads) • Debug for correctness • Tune for performance • Threading applications require multiple iterations of designing, debugging and performance tuning steps • Use tools to improve productivity Threaded Programming Methodology

More Related