Multicore, parallelism, and multithreading

Multicore, parallelism, and multithreading By: Eric Boren, Charles Noneman, and Kristen Janick

Multicore Processing Why we care

What is it? • A processor with more than one core on a single chip • Core: An independent system capable of processing instructions and modifying registers and memory

Motivation • Advancements in component technology and optimization are limited in contribution to processor speed • Many CPUapplications attempt to do multiple things at once: • Video editing • Multi-agent simulation • So, use multiple cores to get it done faster

Hurdles • Instruction assignment (who does what?) • Mostly delegated to the operating system • Can be done to a small degree through dependency analysis on the chip • Cores must still communicate at times – how? • Shared-memory • Message passing

Advantages • Multiple Programs: • Can be separated between cores • Other programs don’t suffer when one hogs CPU • Multi-threaded Applications: • Independent threads don’t have to wait as long for each other – results in faster overall execution • VS Multiple Processors • Less distance between chips - faster communication results in higher maximum clock rate • Less expensive due to smaller overall chip area, shared components (caches, etc.)

Disadvantages • OS and programs must be optimized for multiple cores, or no gain will be seen • In a singly-threaded application, little to no improvement • Overhead in assigning tasks to cores • Real bottleneck is typically memory and disk access time – independent of number of cores

Amdahl’s Law • Potential performance increase on a parallel computing platform is given by Amdahl’s law. • Large problems are made upof several parallelizable parts and non-parallelizable parts. • S = 1/(1-P) • S = speed-up of program • P = fraction of program that is parallizable

Current State of the Art • Commercial processors: • Most have at least 2 cores • Quad-core are highly popular for desktop applications • 6-core processors have recently appeared on the market (Intel’s i7 980X) • 8-core exist but are less common • Academic and research: • MIT: RAW 16-core • Intel Polaris – 80-core • UC Davis: AsAP – 36 and 167-core, individually-clocked

Parallelism

What is Parallel Computing? • Form of computation in which many calculations are carried out simultaneously. • Operating on the principle that large problems can often be divided into smaller ones, which are solved concurrently.

Types of Parallelism • Bit level parallelism • Increase processor word size • Instruction level parallelism • Instructions combined into groups • Data parallelism • Distribute data over different computing environments • Task parallelism • Distribute threads across different computing environments

Flynn’s Taxonomy

Single Instruction, Single Data (SISD) • Provides no parallelism in hardware • 1 data stream processed by the CPU in 1 clock cycle • Instructions executed in serial fashion

Multiple Instruction, Single Data (MISD) • Process single data stream using multiple instruction streams simultaneously • More theoretical model than practical model

Single Instruction, Multiple Data (SIMD) • Single instruction steam has ability to process multiple data streams in 1 clock cycle • Takes operation specified in one instruction and applies it to more than 1 set of data elements at 1 time • Suitable for graphics and image processing

Multiple Instruction, Multiple Data (MIMD) • Different processors can execute different instructions on different pieces of data • Each processor can run independent task

Automatic parallelization • The goal is to relieve programmers from the tedious and error-prone manual parallelization process. • Parallelizing compiler tries to split up a loops so that its iterations can be executed on separate processors concurrently • Identify dependences between references -- independent actions can operate in parallel

Parallel Programming languages • Concurrent programming languages, libraries, API’s, and parallel programming models have been created for programming parallel computers. • Parallel languages make it easier to write parallel algorithms • Resulting code will run more efficiently because the compiler will have more information to work with • Easier to identify data dependencies so that the runtime system can implicitly schedule independent work

Multithreading techniques

fork() • Make a (nearly) exact duplicate of the process • Good when there is no or almost no need to communicate between processes • Often used for servers

fork() Parent Globals Heap Stack Child Globals Heap Stack Child Globals Heap Stack Child Globals Heap Stack Child Globals Heap Stack

fork() pid_tpID = fork(); if (pID == 0) { //child } else { //parent }

POSIX Threads • C library for threading • Available in Linux, OS X • Shared Memory • Threads are created and destroyed manually • Has mechanisms for locking memory

Thread Stack Thread Stack Thread Stack Thread Stack POSIX Threads Process Globals Heap

pthread_t thread; pthread_create( &thread, NULL, function_to_call, (void*) data); //Do stuff pthread_join(thread, NULL); POSIX Threads

int total = 0; void do_work() { //Do stuff to create “result” total = total + result; } Thread 1 reads total (0) Thread 2 reads total (0) Thread 1 does add and saves total (1) Thread 2 does add and saves total (2) POSIX Threads

int total = 0; pthread_mutex_tmutex = PTHREAD_MUTEX_INITIALIZER; void do_work() { //Do stuff to create “result” pthread_mutex_lock( &mutex ); total = total + result; pthread_mutex_unlock( &mutex ); } POSIX Threads

Library and compiler directives for multi-threading Support in Visual C++, gcc Code compiles even if compiler doesn't support OpenMP Popular in high performance communities Easy to add parallelism to existing code OpenMP

const int array_size = 100000; int i, a[array_size]; #pragma omp parallel for for (i = 0; i < array_size; i++) { a[i] = 2 * i; } OpenMPInitialize an Array

#pragma omp parallel for reduction(+:total) for(i = 0; i < array_size; i++) { total = total + a[i]; } OpenMPReduction

Apple Technology for Multi-Threading Programmer puts work into queues A system central process determines the number threads to give to each queue Add code to queues using a closure Right now Mac only, but open source Easy to add parallelism to existing code Grand Central Dispatch

dispatch_apply(array_size, dispatch_get_global_queue(0, 0), ^(int i) { a[i] = 2*i; }); Grand Central DispatchInitialize an Array

void analyzeDocument(doc) { do_analysis(doc); //May take a very long time update_display(); } Grand Central DispatchGUI Example

void analyzeDocument(doc) { dispatch_async(dispatch_get_global_queue(0, 0), ^{ do_analysis(doc); update_display(); }); } Grand Central DispatchGUI Example

Threading in Java, Python, etc. MPI – for clusters Other Technologies

Questions?

Supplemental Reading • Introduction to Parallel Computing • https://computing.llnl.gov/tutorials/parallel_comp/#Abstract • Introduction to Multi-Core Architecture • http://www.intel.com/intelpress/samples/mcp_samplech01.pdf • CPU History: A timeline of microprocessors • http://everything2.com/title/CPU+history%253A+A+timeline+of+microprocessors

Multicore, parallelism, and multithreading