350 likes | 486 Views
An Introduction To Parallel programming. Ing. Andrea Marongiu (a.marongiu@unibo.it). The Multicore Revolution is Here!. More instruction-level parallelism hard to find Very complex designs needed for small gain Thread-level parallelism appears live and well
E N D
An IntroductionToParallelprogramming Ing. Andrea Marongiu (a.marongiu@unibo.it)
The Multicore Revolution is Here! • More instruction-level parallelism hard to find • Very complex designs needed for small gain • Thread-level parallelism appears live and well • Clock frequency scaling is slowing drastically • Too much power and heat when pushing envelope • Cannot communicate across chip fast enough • Better to design small local units with short paths • Effective use of billions of transistors • Easier to reuse a basic unit many times • Potential for very easy scaling • Just keep adding processors/cores for higher (peak) performance
Vocabulary in the Multi Era • AMP, Assymetric MP: Each processor has local memory, tasks statically allocated to one processor • SMP, Shared-Memory MP: Processors share memory, tasks dynamically scheduled to any processor
Vocabulary in the Multi Era • Heterogeneous: Specialization among processors. Often different instruction sets. Usually AMP design. • Homogeneous: all processors have the same instruction set, can run any task, usually SMP design.
The First Software Crisis • 60’s and 70’s: • PROBLEM: AssemblyLanguageProgramming • Needtogetabstraction and portabilitywithoutlosing performance • SOLUTION: High-levelLanguages (Fortran and C) • Provided “common machinelanguage” foruniprocessors
The Second Software Crisis • 80’s and 90’s: • PROBLEM: Inabilitytobuild and maintaincomplex and robustapplicationsrequiringmulti-millionlinesof code developedbyhundredprogrammers • Needtocomposability, malleability and maintainability • SOLUTION: Object-OrientedProgramming (C++ and Java) • Bettertools and software engineeringmethodology (design patterns, specification, testing)
The Third Software Crisis • Today: • PROBLEM: Solidboundarybetween hardware and software • High-levellanguagesabstractaway the hardware • Sequential performance is left behind by Moore’s Law • SOLUTION: What’s under the hood? • Languagefeaturesforarchitecturalawareness
The Software becomes the Problem, AGAIN • Parallelism required to gain performance • Parallel hardware is “easy” to design • Parallel software is (very) hard to write • Fundamentally hard to grasp true concurrency • Especially in complex software environments • Existing software assumes single-processor • Might break in new and interesting ways • Multitasking no guarantee to run on multiprocessor
ParallelProgrammingPrinciples • Coverage (Amdahl’s Law) • Communication/Synchronization • Granularity • LoadBalance • Locality
Coverage • More, lesspowerful (and power-hungry) corestoachievesame performance?
Coverage • Amdahl's Law: The performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used. • Speedup = old running time / new running time = 100 seconds / 60 seconds = 1.67
Amdahl’s Law • p = fraction of work that can be parallelized • n = the number of processors
ImplicationsofAmdahl’s Law • Speedup tends to 1/(1-p) as number of processors tends to infinity • Parallel programming is worthwhile when programs have a lot of work that is parallel in nature Overhead
OverheadofParallelism • Given enough parallel work, this is the biggest barrier to getting desired speedup • Parallelism overheads include: • cost of starting a thread or process • cost of communicating shared data • cost of synchronizing • extra (redundant) computation • Tradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (I.e. large granularity), but not so large that there is not enough parallel work
ParallelProgrammingPrinciples • Coverage (Amdahl’s Law) • Communication/Synchronization • Granularity • LoadBalance • Locality
Communication/Synchronization • Onlyfewprograms are “embarassingly” parallel • Programshavesequentialparts and parallelparts • Needto orchestrate parallelexecutionamongprocessors • Synchronizethreadstomakesuredependencies in the program are preserved • Communicateresultsamongthreadstoensure a consistentviewof data beingprocessed
Communication/Synchronization • Shared Memory • Communication is implicit. One copy of data shared among many threads • Atomicity, locking and synchronizationessentialforcorrectness • Synchronizationistypically in the formof a global barrier • Distributed memory • Communication is explicitthrough messages • Coresaccesslocalmemory • Data distribution and communicationorchestrationisessentialfor performance • Synchronizationisimplicit in messages Overhead
ParallelProgrammingPrinciples • Coverage (Amdahl’s Law) • Communication/Synchronization • Granularity • LoadBalance • Locality
Granularity • Granularity is a qualitative measure of the ratio of computation to communication • Computation stages are typically separated from periods of communication by synchronization events
Granularity • Fine-grain Parallelism • Low computation to communication ratio • Small amounts of computational work between communication stages • Less opportunity for performance enhancement • High communication overhead • Coarse-grain Parallelism • High computation to communication ratio • Large amounts of computational work between communication events • More opportunity for performance increase • Harder to load balance efficiently
ParallelProgrammingPrinciples • Coverage (Amdahl’s Law) • Communication/Synchronization • Granularity • LoadBalance • Locality
The LoadBalancingProblem • Processors that finish early have to wait for the processor with the largest amount of work to complete • Leads to idle time, lowers utilization • Particularlyurgentwithbarriersynchronization UNBALANCEDworkloads Core 1 Core 2 Core 3 Core 4 Core 1 Core 2 Core 3 Core 4 BALANCEDworkloads Slowestcoredictatesoverallexecutiontime
StaticLoadBalancing • Programmer make decisions and assigns a fixed amount of work to each processing core a priori • Works well for homogeneousmulticores • All core are the same • Each core has an equal amount of work • Not so well for heterogeneousmulticore • Some cores may be faster than others • Work distribution is uneven
DynamicLoadBalancing • Workload is partitioned in small tasks. Available tasks for processing are pushed in a work-queue • When one core finishes its allocated task, it takes on further work from the queue. The process continues until all tasks are assigned to some core for processing. • Ideal for codes where work is uneven, and in heterogeneous multicore Core 1 Core 2 Core 3 Core 4
ParallelProgrammingPrinciples • Coverage (Amdahl’s Law) • Communication/Synchronization • Granularity • LoadBalance • Locality
Memory Access Latency • Uniform Memory Access (UMA) – Shared Memory • Centrally located shared memory • All processors are equidistant (access times) • Non-Uniform Access (NUMA) • Shared memory – Processors have the same address space data is directly accessible by all, but cost depends on the distance • Placement of data affects performance • Distributedmemory– Processorshave private addressspaces Data accessislocal, butcostofmessagesdepends on the distance • Communicationmustbeefficientlyarchitected
LocalityofMemoryAccesses(UMA SharedMemory) • Parallel computation is serialized due to memory contention and lack of bandwidth
LocalityofMemoryAccesses(UMA SharedMemory) • Distribute data to relieve contention and increase effective bandwidth
LocalityofMemoryAccesses(NUMA SharedMemory) int main() { /* Task 1 */ for (i = 0; i < n; i++) A[i][rand()] = foo (); /* Task 2 */ for (j = 0; j < n; j++) B[j] = goo (); } SPM SPM SPM SPM CPU1 CPU2 CPU2 CPU2 INTERCONNECT Once parallel tasks have been assigned to different processors.. SHARED MEMORY
LocalityofMemoryAccesses(NUMA SharedMemory) ..phisicalplacement of data can have a great impact on performance! int main() { /* Task 1 */ for (i = 0; i < n; i++) A[i][rand()] = foo (); /* Task 2 */ for (j = 0; j < n; j++) B[j] = goo (); } SPM SPM SPM SPM Memory reference cost = Bus latency + Off-chip memory latency (100 cycles) CPU1 CPU2 CPU2 CPU2 INTERCONNECT SHARED MEMORY A B
LocalityofMemoryAccesses(NUMA SharedMemory) int main() { /* Task 1 */ for (i = 0; i < n; i++) A[i][rand()] = foo (); /* Task 2 */ for (j = 0; j < n; j++) B[j] = goo (); } SPM SPM SPM SPM Memory reference cost = Bus latency + On-chip memory latency (2-20 cycles) CPU1 CPU2 CPU2 CPU2 INTERCONNECT SHARED MEMORY
LocalityofMemoryAccesses(NUMA SharedMemory) int main() { /* Task 1 */ for (i = 0; i < n; i++) A[i][rand()] = foo (); /* Task 2 */ for (j = 0; j < n; j++) B[j] = goo (); } SPM SPM SPM SPM Memory reference cost = Local Memory Latency (1 cycle) CPU1 CPU2 CPU2 CPU2 INTERCONNECT SHARED MEMORY