1 / 31

Introduction to Parallel Processing with Multi-core Part II—Algorithms

Introduction to Parallel Processing with Multi-core Part II—Algorithms. Jie Liu, Ph.D. Professor Department of Computer Science Western Oregon University USA liuj@wou.edu. Part II outline. More about PRAM Activating PRAM processors Finding Max in constant amount of time

manasa
Download Presentation

Introduction to Parallel Processing with Multi-core Part II—Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Parallel Processing with Multi-corePart II—Algorithms Jie Liu, Ph.D. Professor Department of Computer Science Western Oregon University USA liuj@wou.edu

  2. Part II outline • More about PRAM • Activating PRAM processors • Finding Max in constant amount of time • Algorithms on PRAM • The fan-in algorithm • The list ranking algorithm • The parallel merge algorithm • The prefix sum algorithm • Brent’s theorem and the use of it • Speedup and its calculation • The cost of a parallel algorithm and the Cost Optimal concept • NC class and P Complete • Amdahl’s Law and Gustafson-Barsis’ Law

  3. More About PRAM • Remember, each PRAM processor can either • Perform the prescribed operation (the same for all processors), • Carry out an I/O operation, • Idle, or • Activate another processor • So, it takes n processors to activate another n processors, then we have 2n active processors • Now two questions • What happens if two processors write to the same memory location? • How many steps does it take to activate n processors

  4. Handling Writing Conflicts in PRAM • EREW (Exclusive Read Exclusive Write) • CREW (Concurrent Read Exclusive Write) • CRCW (Concurrent Read Concurrent Write) • Common– all the values are the same • Arbitrary – pick a value and set it • Priority – the processors with the highest priority is the winner • A multi-core computer is which one of the above?

  5. Activating n Processors • Let’s Activate( ) represents activation of by • Let’s if ( ) { } else { }, for { }, while { }, and { } have their standard meanings • Let’s the symbol = demotes assignment operation • Let’s for all <processor list> do {statement list} represent that the code segment to be executed in parallel by all the processors in the processor list Spawn( ) // assuming is already active { for i0 to do for all where 0 <= j < if ( < n) Activate ( ) }

  6. About the procedure spawn • What is the complexity ? • If forms a binomal tree

  7. Finding Max in a constant time • Input: an array of n integers arrA[0..n-1] • Output: the largest of number of arrA[0..n-1] • Global variable arrB[0..n-1], i, j • Assume the computer is a CRCW/Common FindignMax(arrA[0..n-1]) { • for all where 0 <= i < n-1 • arrB[i] = 1 • for all where 0 <= i, j < n-1 • if (arrA[i] < arrA[j]) • arrB[i] = 0 • for all where 0 <= i < n-1 • if (arrB[i] = 1) • print arrA[i] }

  8. Finding Max – how does it work • After line 2, every B[i] is 1 • Line 3 ~ 5 • for all where 0 <= i, j < n-1 • if arrA[i] < arrA[j] • arrB[i] = 0 • Write a 0 to B[i] if A[i] is smaller then an element in A because it is CRCW/Common

  9. Finding Max questions • How to do it sequentially and what is the complexity then? • How to do it in parallel and what is the complexity? • How many processors is needed? • Will the algorithm work if the computer is CRCW/Arbitrary? • On the PRAM, what is the min amount of time required to run the algorithm, assuming only is activated? [Hint: Remember spawn( )] • Other approach of finding the max?

  10. Fan-in algorithm • Also called reduction to calculate where  is associative. When  is +, the calculation is sum. • The figure on the right shows summing of n numbers using processors

  11. Fan-in algorithm (2) FanInTotal(A[0..n-1], n) // n >=1, the sum of array A is in A[0], machine is CREW { Spawn( ) for all where 0 <= i <= for j from 0 to if i mod and }

  12. Fan-in algorithm questions • How to do it sequentially and what is the complexity then? • How is it done in parallel and what is the complexity? • How many processors is needed? • Will the algorithm work if the computer is CRCW or EREW?

  13. Prefix Sum Let be n values and  be an associative operator, the prefix sums is to find the following n quantities PrefixSums(A[0..n-1], n) //n >=1 { • Spawn( ) • for all where 1 <= i <= n - 1 • for j from 0 to • if }

  14. Prefix Sum Questions • How to do it sequentially and what is the complexity then? • How is it done in parallel and what is the complexity? • How many processors is needed? Why isn’t processor 0 used?

  15. List Ranking • We are using an array to represent a linked list • To determine, for each element, the number of elements that is in front of it is called the list ranking problem • How to do list ranking sequentially • Can we perform list ranking in parallel?

  16. Parallel List Ranking—Algorithm ListRamking(next[0..n-1], n) // array next contains the pointers of linked list { pos[0..n-1] // local variable, array of int contains the result – the ranking • Spawn( ) • for all where 0 <= i <= n - 1 • { • pos[i] = 1 • if next[i] = i • pos [i] = 0 • for j from 1 to • { • pos[i] = pos[i] + pos[next[i]] • next[i] = next[next[i]] • } } } Remember PRAM all processors must carry out the same operation

  17. Parallel List Ranking—Algorithm Explained • Key steps … for all where 0 <= i <= n - 1 { … … for j from 1 to { pos[i] = pos[i] + pos[next[i]] next[i] = next[next[i]] } } j = 1  j = 2  j = 3  j = 4 

  18. Parallel List Ranking—Algorithm Explained (2) • Key steps … for all where 0 <= i <= n - 1 { … … for j from 1 to { pos[i] = pos[i] + pos[next[i]] next[i] = next[next[i]] } } j = 1  j = 2  j = 3  j = 4 

  19. Understanding Parallel List Ranking Algorithm • Key steps … for all where 0 <= i <= n - 1 { … … for j from 1 to { pos[i] = pos[i] + pos[next[i]] next[i] = next[next[i]] } } j = 1  j = 2  j = 3  j = 4 

  20. Understanding Parallel List Ranking Algorithm • Key steps … for all where 0 <= i <= n - 1 { … … for j from 1 to { pos[i] = pos[i] + pos[next[i]] next[i] = next[next[i]] } } j = 1  j = 2  j = 3  j = 4 

  21. List Ranking Questions • How to do it sequentially and what is the complexity then? • How is it done in parallel and what is the complexity? • How many processors is needed? • What is the key step that make an apparently sequential problem resolved with a concurrent solution? • Will the algorithm work if the computer is CRCW or EREW?

  22. Merging Two Sorted Arrays • The problem: n is an even number. An array of size n stores two sorted sequence of integers of size n/2, we need to merge the two sorted segment in O(log (n)) steps.

  23. Merging Two Sorted Arrays (2) • The sequential approach: two yardsticks • The sequential approach has no concurrency to exploit • Calling for new algorithms • Key idea: if we know there are kelements smaller than A[i], we can copy A[i] to A[k] in one step. • If i<n/2, then there are i -1 elements smaller than A[i] (assuming array is 1 based). Now how can we find the number of elements in the second half of A that is also smaller than A[i] – binary search(a log (n) algorithm)!

  24. Merging Two Sorted Arrays In Parallel //A[1] to A[n/2] and A[n/2 +1] to A[n] are two sorted sections MergeArray(A[1..n]) { int x, low, high, index for all where 1 <= i <= n // The lower half search the upper half, the upper half search for the lower half { high = 1 // assuming it is the upper half low = n/2 If i <= (n/2) { high = n low = (n/2) + 1} x = A[i] // perform binary search Repeat { index = If x < A[index] high = index – 1 else low = index + 1 } until low > high A[high + I – n/2] = x } }

  25. Brent’s Theorem • Given A, a parallel algorithm with computation time t, if parallel algorithm A perform m computational operations, then p processors can execute algorithm A in time t + (m – t)/p. • Proof: Let being the number of computational operations performed by A at step i, where 1≤ i <≤ t. By definition, we have . Using p processors we can simulate computational operations at step i in time • Therefore,

  26. Applying Brent’s Theorem • For the Sum algorithm, the execution time is ; however, the total amount of computation is n-1. • Noticing that we use n/2 processors during the first step, n/2/2 processors the second step, … …, and 1 processor the last step. However, we have allocated n/2 processors initially to the problem. So, after each step more and more processors are idling. • If we only assign processors, then the execute time is, according to Brent’s theorem • That is, by reducing the number of processor to does not change the complexity of the parallel algorithm.

  27. Cost of a parallel algorithm • The cost of a parallel algorithm is defined to be the product of the algorithm’s time complexity and the number of processors used. • The original Sum algorithm has a cost of(n*log n). • The algorithm that uses processors has a cost of • Note that (n) is the same as the sequential algorithm. • A cost optimal parallel algorithm is an algorithm for which the cost is in the same complexity class as an optimal sequential algorithm for the same problem. • Sum using processors is an example of cost optimal. • Using n*n processor to find the Max in a constant time is not an example of cost optimal.

  28. NC class and P Complete • NC is the class of problems solvable on a PRAM in poly-logarithmic time using a number of processors that are a polynomial function of the problem size. • All, except the Finding Max, algorithms we have discussed are in NC. This is the class of problems we are interested in finding parallel solutions • P is a class of problems solvable, sequentially, in polynomial time • A problem L  Pis P-complete if every other problem in P can be transformed to L in ploy-logarithmic time using PRAM with a polynomial number of processor. Notes the transformation is in NC. • Example of P-complete problems are depth-first search of an arbitrary graph and the circuit value problem. • P-complete is a class of problems appear to not have a parallel solution  we just cannot prove it yet!

  29. Speedup – Take II • Speedup = • Parallelizability ratio of time on one CPU vs. that on p CPU. • For most of the parallel algorithms, its speedup is less than n, the number of processors • If an algorithm enjoys speedup of n when n is large, we consider the algorithm scalable • Super lineal speedup is when the speedup of an algorithm is greater than n, the number of processors • This can happen if the parallel algorithm introduces a new approach of solving the problem • The parallel algorithm utilizes the cache memory for efficiently • The parallel algorithm gets lucky. For example, performing breadth first search.

  30. Amdahl’s Law, and Gustafson-Barsis’s Law • Amdahl’s Law: Let s be the fraction of operations in a computation that must be performed sequentially, where 0≤ s ≤ 1. The maximum speedup achievable by a parallel computer with p processors performing the computation is • Gustafson-Barsis’s Law: Given a parallel program solving a problem using p processors, let s denote the fraction of the total execution performed sequentially. The maximum speedup achievable by this program is • In a way, these two law’s contradicts with each other. How can we explain this contradiction?

  31. Part II outline • More about PRAM • Activating PRAM processors • Finding Max in constant amount of time • Algorithms on PRAM • The fan-in algorithm • The list ranking algorithm • The parallel merge algorithm • The prefix sum algorithm • Brent’s theorem and the use of it • Speedup and its calculation • The cost of a parallel algorithm and the Cost Optimal concept • NC class and P Complete • Amdahl’s Law and Gustafson-Barsis’ Law

More Related