1 / 34

An Introduction To Parallel programming

An Introduction To Parallel programming. Ing. Andrea Marongiu (a.marongiu@unibo.it). The Multicore Revolution is Here!. More instruction-level parallelism hard to find Very complex designs needed for small gain Thread-level parallelism appears live and well

Download Presentation

An Introduction To Parallel programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An IntroductionToParallelprogramming Ing. Andrea Marongiu (a.marongiu@unibo.it)

  2. The Multicore Revolution is Here! • More instruction-level parallelism hard to find • Very complex designs needed for small gain • Thread-level parallelism appears live and well • Clock frequency scaling is slowing drastically • Too much power and heat when pushing envelope • Cannot communicate across chip fast enough • Better to design small local units with short paths • Effective use of billions of transistors • Easier to reuse a basic unit many times • Potential for very easy scaling • Just keep adding processors/cores for higher (peak) performance

  3. Vocabulary in the Multi Era • AMP, Assymetric MP: Each processor has local memory, tasks statically allocated to one processor • SMP, Shared-Memory MP: Processors share memory, tasks dynamically scheduled to any processor

  4. Vocabulary in the Multi Era • Heterogeneous: Specialization among processors. Often different instruction sets. Usually AMP design. • Homogeneous: all processors have the same instruction set, can run any task, usually SMP design.

  5. Future Embedded Systems

  6. The First Software Crisis • 60’s and 70’s: • PROBLEM: AssemblyLanguageProgramming • Needtogetabstraction and portabilitywithoutlosing performance • SOLUTION: High-levelLanguages (Fortran and C) • Provided “common machinelanguage” foruniprocessors

  7. The Second Software Crisis • 80’s and 90’s: • PROBLEM: Inabilitytobuild and maintaincomplex and robustapplicationsrequiringmulti-millionlinesof code developedbyhundredprogrammers • Needtocomposability, malleability and maintainability • SOLUTION: Object-OrientedProgramming (C++ and Java) • Bettertools and software engineeringmethodology (design patterns, specification, testing)

  8. The Third Software Crisis • Today: • PROBLEM: Solidboundarybetween hardware and software • High-levellanguagesabstractaway the hardware • Sequential performance is left behind by Moore’s Law • SOLUTION: What’s under the hood? • Languagefeaturesforarchitecturalawareness

  9. The Software becomes the Problem, AGAIN • Parallelism required to gain performance • Parallel hardware is “easy” to design • Parallel software is (very) hard to write • Fundamentally hard to grasp true concurrency • Especially in complex software environments • Existing software assumes single-processor • Might break in new and interesting ways • Multitasking no guarantee to run on multiprocessor

  10. ParallelProgrammingPrinciples • Coverage (Amdahl’s Law) • Communication/Synchronization • Granularity • LoadBalance • Locality

  11. Coverage • More, lesspowerful (and power-hungry) corestoachievesame performance?

  12. Coverage • Amdahl's Law: The performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used. • Speedup = old running time / new running time = 100 seconds / 60 seconds = 1.67

  13. Amdahl’s Law • p = fraction of work that can be parallelized • n = the number of processors

  14. ImplicationsofAmdahl’s Law • Speedup tends to 1/(1-p) as number of processors tends to infinity • Parallel programming is worthwhile when programs have a lot of work that is parallel in nature Overhead

  15. OverheadofParallelism • Given enough parallel work, this is the biggest barrier to getting desired speedup • Parallelism overheads include: • cost of starting a thread or process • cost of communicating shared data • cost of synchronizing • extra (redundant) computation • Tradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (I.e. large granularity), but not so large that there is not enough parallel work

  16. ParallelProgrammingPrinciples • Coverage (Amdahl’s Law) • Communication/Synchronization • Granularity • LoadBalance • Locality

  17. Communication/Synchronization • Onlyfewprograms are “embarassingly” parallel • Programshavesequentialparts and parallelparts • Needto orchestrate parallelexecutionamongprocessors • Synchronizethreadstomakesuredependencies in the program are preserved • Communicateresultsamongthreadstoensure a consistentviewof data beingprocessed

  18. Communication/Synchronization • Shared Memory • Communication is implicit. One copy of data shared among many threads • Atomicity, locking and synchronizationessentialforcorrectness • Synchronizationistypically in the formof a global barrier • Distributed memory • Communication is explicitthrough messages • Coresaccesslocalmemory • Data distribution and communicationorchestrationisessentialfor performance • Synchronizationisimplicit in messages Overhead

  19. ParallelProgrammingPrinciples • Coverage (Amdahl’s Law) • Communication/Synchronization • Granularity • LoadBalance • Locality

  20. Granularity • Granularity is a qualitative measure of the ratio of computation to communication • Computation stages are typically separated from periods of communication by synchronization events

  21. Granularity • Fine-grain Parallelism • Low computation to communication ratio • Small amounts of computational work between communication stages • Less opportunity for performance enhancement • High communication overhead • Coarse-grain Parallelism • High computation to communication ratio • Large amounts of computational work between communication events • More opportunity for performance increase • Harder to load balance efficiently

  22. ParallelProgrammingPrinciples • Coverage (Amdahl’s Law) • Communication/Synchronization • Granularity • LoadBalance • Locality

  23. The LoadBalancingProblem • Processors that finish early have to wait for the processor with the largest amount of work to complete • Leads to idle time, lowers utilization • Particularlyurgentwithbarriersynchronization UNBALANCEDworkloads Core 1 Core 2 Core 3 Core 4 Core 1 Core 2 Core 3 Core 4 BALANCEDworkloads Slowestcoredictatesoverallexecutiontime

  24. StaticLoadBalancing • Programmer make decisions and assigns a fixed amount of work to each processing core a priori • Works well for homogeneousmulticores • All core are the same • Each core has an equal amount of work • Not so well for heterogeneousmulticore • Some cores may be faster than others • Work distribution is uneven

  25. DynamicLoadBalancing • Workload is partitioned in small tasks. Available tasks for processing are pushed in a work-queue • When one core finishes its allocated task, it takes on further work from the queue. The process continues until all tasks are assigned to some core for processing. • Ideal for codes where work is uneven, and in heterogeneous multicore Core 1 Core 2 Core 3 Core 4

  26. ParallelProgrammingPrinciples • Coverage (Amdahl’s Law) • Communication/Synchronization • Granularity • LoadBalance • Locality

  27. Memory Access Latency • Uniform Memory Access (UMA) – Shared Memory • Centrally located shared memory • All processors are equidistant (access times) • Non-Uniform Access (NUMA) • Shared memory – Processors have the same address space  data is directly accessible by all, but cost depends on the distance • Placement of data affects performance • Distributedmemory– Processorshave private addressspaces Data accessislocal, butcostofmessagesdepends on the distance • Communicationmustbeefficientlyarchitected

  28. LocalityofMemoryAccesses(UMA SharedMemory) • Parallel computation is serialized due to memory contention and lack of bandwidth

  29. LocalityofMemoryAccesses(UMA SharedMemory) • Distribute data to relieve contention and increase effective bandwidth

  30. LocalityofMemoryAccesses(NUMA SharedMemory) int main() { /* Task 1 */ for (i = 0; i < n; i++) A[i][rand()] = foo (); /* Task 2 */ for (j = 0; j < n; j++) B[j] = goo (); } SPM SPM SPM SPM CPU1 CPU2 CPU2 CPU2 INTERCONNECT Once parallel tasks have been assigned to different processors.. SHARED MEMORY

  31. LocalityofMemoryAccesses(NUMA SharedMemory) ..phisicalplacement of data can have a great impact on performance! int main() { /* Task 1 */ for (i = 0; i < n; i++) A[i][rand()] = foo (); /* Task 2 */ for (j = 0; j < n; j++) B[j] = goo (); } SPM SPM SPM SPM Memory reference cost = Bus latency + Off-chip memory latency (100 cycles) CPU1 CPU2 CPU2 CPU2 INTERCONNECT SHARED MEMORY A B

  32. LocalityofMemoryAccesses(NUMA SharedMemory) int main() { /* Task 1 */ for (i = 0; i < n; i++) A[i][rand()] = foo (); /* Task 2 */ for (j = 0; j < n; j++) B[j] = goo (); } SPM SPM SPM SPM Memory reference cost = Bus latency + On-chip memory latency (2-20 cycles) CPU1 CPU2 CPU2 CPU2 INTERCONNECT SHARED MEMORY

  33. LocalityofMemoryAccesses(NUMA SharedMemory) int main() { /* Task 1 */ for (i = 0; i < n; i++) A[i][rand()] = foo (); /* Task 2 */ for (j = 0; j < n; j++) B[j] = goo (); } SPM SPM SPM SPM Memory reference cost = Local Memory Latency (1 cycle) CPU1 CPU2 CPU2 CPU2 INTERCONNECT SHARED MEMORY

  34. Locality in Communication(MessagePassing)

More Related