Lecture 2 The Art of Concurrency

Lecture 2 The Art of Concurrency 张奇复旦大学 COMP630030 Data Intensive Computing

并行让程序运行的更快

Why Do I Need to Know This? What’s in It for Me? • There’s no way to avoid this topic • Multicore processors are here now and here to stay • “The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software” (Dr. Dobb’s Journal, March 2005)

Isn’t Concurrent Programming Hard? • Concurrent programming is no walk in the park. • With a serial program • execution of your code takes a predictable path through the application. • Concurrent algorithms • require you to think about multiple execution streams running at the same time

PRIMER ON CONCURRENT PROGRAMMING • Concurrent programming is all about independent computations that the machine can execute in any order. • Not everything within an application will be independent, so you will still need to deal with serial execution amongst the concurrency.

Four Steps of a Threading Methodology • Step 1. Analysis: Identify Possible Concurrency • Find the parts of the application that contain independent computations. • identify hotspots that might yield independent computations

Four Steps of a Threading Methodology • Step 2. Design and Implementation: Threading the Algorithm • This step is what this book is all about

Four Steps of a Threading Methodology • Step 3. Test for Correctness: Detecting and Fixing Threading Errors • Step 4. Tune for Performance: Removing Performance Bottlenecks

Design Models for Concurrent Algorithms • The way you approach your serial code will influence how you reorganize the computations into a concurrent equivalent. • Task decomposition • Independent tasks that threads • Data decomposition • Compute every element of the data independently.

Task Decomposition • Any concurrent transformation process is to identify computations that are completely independent. • Satisfy or remove dependencies

Example: numerical integration • What are the independent tasks in this simple application? • Are there any dependencies between these tasks and, if so, how can we satisfy them? • How should you assign tasks to threads?

Three key elements for any task decomposition design • What are the tasks and how are they defined? • What are the dependencies between tasks and how can they be satisfied? • How are the tasks assigned to threads?

Two criteria for the actual decomposition into tasks • There should be at least as many tasks as there will be threads (or cores). • The amount of computation within each task (granularity) must be large enough to offset the overhead that will be needed to manage the tasks and the threads.

What are the dependencies between tasks and how can they be satisfied? • Order dependency • some task relies on the completed results of the computations from another task • schedule tasks that have an order dependency onto the same thread • insert some form of synchronization to ensure correct execution order

What are the dependencies between tasks and how can they be satisfied? • Data dependency • assignment of values to the same variable that might be done concurrently • updates to a variable that could be read concurrently • create variables that are accessible only to a given thread. • Atomic Operation

How are the tasks assigned to threads? • Tasks must be assigned to threads for execution. • The amount of computation done by threads should be roughly equivalent. • We can allocate tasks to threads in two different ways: static scheduling or dynamic scheduling.

How are the tasks assigned to threads? • In static scheduling, the division of labor is known at the outset of the computation and doesn’t change during the computation. • Static scheduling is best used in those cases where the amount of computation within each task is the same or can be predicted at the outset.

How are the tasks assigned to threads? • Under a dynamic schedule, you assign tasks to threads as the computation proceeds. • The driving force behind the use of a dynamic schedule is to try to balance the load as evenly as possible between threads.

Example: numerical integration • What are the independent tasks in this simple application? • Are there any dependencies between these tasks and, if so, how can we satisfy them? • How should you assign tasks to threads?

Data Decomposition • Execution is dominated by a sequence of update operations on all elements of one or more large data structures. • These update computations are independent of each other • Dividing up the data structure(s) and assigning those portions to threads, along with the corresponding update computations (tasks)

Three key elementsfor data decomposition design • How should you divide the data into chunks? • How can you ensure that the tasks for each chunk have access to all data required for updates? • How are the data chunks assigned to threads?

How should you divide the data into chunks?

How should you divide the data into chunks? • Granularity of chunk • Shape of chunk • the neighboring chunks are and how any exchange of data • More vigilant with chunks of irregular shapes

How can you ensure that the tasks for each chunk have access to all data required for updates?

Example: Game of Life on a finite grid

Example: Game of Life on a finite grid • What is the large data structure in this application and how can you divide it into chunks? • What is the best way to perform the division?

What’s Not Parallel • Algorithms with State • something kept around from one execution to the next. • For example, the seed to a random number generator or the file pointer for I/O would be considered state.

What’s Not Parallel • Recurrences

What’s Not Parallel • Induction Variables

What’s Not Parallel • Reduction • Reductions take a collection (typically an array) of data and reduce it to a single scalar value through some combining operation.

Loop-Carried Dependence

Rule 1: Identify Truly Independent Computations • It’s the crux of the whole matter!

Rule 2: Implement Concurrency at the Highest Level Possible • Two directions：bottom-up and top-down • bottom-up • consider threading the hotspots directly • If this is not possible, search up the call stack • top-down • first consider the whole application and what the computation is coded to accomplish • While there is no obvious concurrency, distill the parts of the computation Video encoding application： individual pixels  frames  videos

Rule 3: Plan Early for Scalability to Take Advantage of Increasing Numbers of Cores • Quad-core processors are becoming the default multicore chip. • Flexible code that can take advantage of different numbers of cores. • C. Northcote Parkinson, “Data expands to fill the processing power available.”

Rule 4: Make Use of Thread-Safe Libraries Wherever Possible • Intel Math Kernel Library (MKL) • Intel Integrated Performance Primitives (IPP)

Rule 5: Use the Right Threading Model • Don’t use explicit threads if an implicit threading model (e.g., OpenMPor Intel Threading Building Blocks) has all the functionality you need.

Rule 6: Never Assume a Particular Order of Execution

Rule 7: Use Thread-Local Storage Whenever Possible or Associate Locks to Specific Data • Synchronization is overhead that does not contribute to the furtherance of the computation • Should actively seek to keep the amount of synchronization to a minimum. • Using storage that is local to threads or using exclusive memory locations

Rule 8: Dare to Change the Algorithm for a Better Chance of Concurrency • When choosing between two or more algorithms, programmers may rely on the asymptotic order of execution • O(n log n) algorithm will run faster than an O(n2) algorithm • If you cannot easily turn a hotspot into threaded code, you should consider using a suboptimal serial algorithm to transform, rather than the algorithm currently in the code.

Parallel Sum

PRAM Algorithm

PRAM Algorithm Can we use the PRAM algorithm for parallel sum in a threaded code?

A More Practical Algorithm • Divide the data array into chunks equal to the number of threads to be used. • Assign each thread a unique chunk and sum the values within the assigned subarrayinto a private variable. • Add these local partial sums to compute the total sum of the array elements.

Prefix Scan

Prefix Scan • PRAM computation for prefix scan

Lecture 2 The Art of Concurrency