1 / 68

Parallel and Distributed Algorithms Spring 2010

Parallel and Distributed Algorithms Spring 2010. Johnnie W. Baker. Chapter 1. Introduction and General Concepts. References. Selim Akl, Parallel Computation: Models and Methods, Prentice Hall, 1997, Chapter 1, Updated online version available through website. -- primary reference

zulema
Download Presentation

Parallel and Distributed Algorithms Spring 2010

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

  2. Chapter 1 Introduction and General Concepts

  3. References • Selim Akl, Parallel Computation: Models and Methods, Prentice Hall, 1997, Chapter 1, Updated online version available through website. -- primary reference • Selim Akl, The Design of Efficient Parallel Algorithms, Chapter 2 in “Handbook on Parallel and Distributed Processing” edited by J. Blazewicz, K. Ecker, B. Plateau, and D. Trystram, Springer Verlag, 2000. • Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar, Introduction to Parallel Computing, 2nd Edition, Addison Wesley, 2003. • Harry Jordan and Gita Alaghband, Fundamentals of Parallel Processing: Algorithms Architectures, Languages, Prentice Hall, 2003. • Michael Quinn, Parallel Programming in C with MPI and OpenMP, McGraw Hill, 2004, Chapter 1-2. -- secondary reference • Michael Quinn, Parallel Computing: Theory and Practice, McGraw Hill, 1994 • Barry Wilkenson and Michael Allen, Parallel Programming, 2nd Ed.,Prentice Hall, 2005.

  4. Outline • Need for Parallel & Distributed Computing • Flynn’s Taxonomy of Parallel Computers • Two Main Types of MIMD Computers • Examples of Computational Models • Data Parallel & Functional/Control/Job Parallel • Granularity • Analysis of Parallel Algorithms • Elementary Steps: computational and routing steps • Running Time & Time Optimal • Parallel Speedup • Cost and Work • Efficiency • Linear and Superlinear Speedup • Speedup and Slowdown Folklore Theorems • Amdahl’s and Gustafon’s Law

  5. Reasons to Study Parallel & DistributedComputing • Sequential computers have severe limits to memory size • Significant slowdowns occur when accessing data that is stored in external devices. • Sequential computational times for most large problems are unacceptable. • Sequential computers can not meet the deadlines for many real-time problems. • Some problems are distributed in nature and natural for distributed computation

  6. Grand Challenge Problems Computational Science Categories (1989) • Quantum chemistry, statistical mechanics, and relativistic physics • Cosmology and astrophysics • Computational fluid dynamics and turbulence • Materials design and superconductivity • Biology, pharmacology, genome sequencing, genetic engineering, protein folding, enzyme activity, and cell modeling • Medicine, and modeling of human organs and bones • Global weather and environmental modeling

  7. Weather Prediction • Atmosphere is divided into 3D cells • Data includes temperature, pressure, humidity, wind speed and direction, etc • Recorded at regular time intervals in each cell • There are about 5×103 cells of 1 mile cubes. • Calculations would take a modern computer over 100 days to perform calculations needed for a 10 day forecast • Details in Ian Foster’s 1995 online textbook • Design & Building Parallel Programs • See author’s online copy for further information

  8. Flynn’s Taxonomy • Best-known classification scheme for parallel computers. • Depends on parallelism it exhibits with its • Instruction stream • Data stream • A sequence of instructions (the instruction stream) manipulates a sequence of operands (the data stream) • The instruction stream (I) and the data stream (D) can be either single (S) or multiple (M) • Four combinations: SISD, SIMD, MISD, MIMD

  9. SISD • Single Instruction, Single Data • Single-CPU systems • i.e., uniprocessors • Note: co-processors don’t count • Concurrent processing allowed • Instruction prefetching • Pipelined execution of instructions • Primary Example: PCs

  10. SIMD • Single instruction, multiple data • One instruction stream is broadcast to all processors • Each processor (also called a processing element or PE) is very simplistic and is essentially an ALU; • PEs do not store a copy of the program nor have a program control unit. • Individual processors can be inhibited from participating in an instruction (based on a data test).

  11. SIMD (cont.) • All active processor executes the same instruction synchronously, but on different data • On a memory access, all active processors must access the same location in their local memory. • The data items form an array and an instruction can act on the complete array in one cycle.

  12. SIMD (cont.) • Quinn (2004) calls this architecture a processor array • Two examples are Thinking Machines’ • Connection Machine CM2 • Connection Machine CM200 • Also, MasPar’s MP1 and MP2 are examples • Quinn also considers a pipelined vector processor to be a SIMD • Somewhat non-standard. • An example is the Cray-1

  13. MISD • Multiple instruction, single data • Quinn (2004) argues that a systolic array is an example of a MISD structure (pg 55-57) • Some authors include pipelined architecture in this category • This category does not receive much attention from most authors

  14. MIMD • Multiple instruction, multiple data • Processors are asynchronous, since they can independently execute different programs on different data sets. • Communications is handled either by use of message passing or through shared memory. • Considered by most researchers to contain the most powerful, least restricted computers.

  15. MIMD (cont.) • Have major communication costs that are usually ignored when • comparing to SIMDs • when computing algorithmic complexity • A common way to program MIMDs is for all processors to execute the same program. • Called SPMD method (for single program, multiple data) • Usual method when number of processors are large.

  16. Multiprocessors(Shared Memory MIMDs) • All processors have access to all memory locations . • The processors access memory through some type of interconnection network. • This type of memory access is called uniform memory access (UMA) . • Most multiprocessors have hierarchical or distributed memory systems in order to provide fast access to all memory locations. • This type of memory access is called nonuniform memory access (NUMA).

  17. Multiprocessors (cont.) • Normally, fast cache is used with NUMA systems to reduce the problem of different memory access time for PEs. • This creates the problem of ensuring that all copies of the same date in different memory locations are identical. • Symmetric Multiprocessors (SMPs) are a popular example of shared memory multiprocessors

  18. Multicomputers (Message-Passing MIMDs) • Processors are connected by an interconnection network • Each processor has a local memory and can only access its own local memory. • Data is passed between processors using messages, as dictated by the program. • A common approach to programming multiprocessors is to use a message passing language (e.g., MPI, PVM) • The problem is divided into processes that can be executed concurrently on individual processors. Each processor is normally assigned multiple processes.

  19. Multicomputers (cont. 2/3) • Programming disadvantages of message-passing • Programmers must make explicit message-passing calls in the code • This is low-level programming and is error prone. • Data is not shared but copied, which increases the total data size. • Data Integrity: difficulty in maintaining correctness of multiple copies of data item.

  20. Multicomputers (cont. 3/3) • Programming advantages of message-passing • No problem with simultaneous access to data. • Allows different PCs to operate on the same data independently. • Allows PCs on a network to be easily upgraded when faster processors become available.

  21. Data Parallel Computation • All tasks (or processors) apply the same set of operations to different data. • Example: • SIMDs always use data parallel computation. • All active processors execute the operations synchronously. for i  0 to 99 do a[i]  b[i] + c[i] endfor

  22. Data Parallel Computation for MIMDs • A common way to support data parallel programming on MIMDs is to use SPMD (single program multiple data) programming as follows: • Processors (or tasks) executing the same block of instructions concurrently but asynchronously • No communication or synchronization occurs within these concurrent instruction blocks. • Each instruction block is normally followed by synchronization and communication steps • If processors have identical tasks (with different data), this method allows these to be executed in a data parallel fashion.

  23. Functional/Control/Job Parallelism • Involves executing different sets of operations on different data sets. • Typical MIMD control parallelism • Problem is divided into different non-identical tasks • Tasks are distributed among the processors • Tasks are usually divided between processors so that their workload is roughly balanced • Each processor follows some scheme for executing its tasks. • Frequently, a task scheduling algorithm is used. • If the task being executed has to wait for data, the task is suspended and another task is executed.

  24. Grain Size • Quinn defines grain size to be the average number of computations performed between communication or synchronization steps. • Quinn (2004), pg 411 • Data parallel programming usually results in smaller grain size computation • SIMD computation is considered to be fine-grain • MIMD data parallelism is usually considered to be medium grain • Generally, increasing the grain size of increases the performance of MIMD programs.

  25. Examples of Parallel Models • Synchronous PRAM (Parallel RAM) • Each processor stores a copy of the program • All execute instructions simultaneously on different data • Generally, the active processors execute the same instruction • The full power of PRAM allows processors to execute different instructions simultaneously. • Here, “simultaneous” means they execute using same clock. • All computers have the same (shared) memory. • data exchanges occur through share memory • All processors have I/O capabilities. • See Akl, Figure 1.2

  26. Examples of Parallel Models (cont) • Binary Tree of Processors (Akl, Fig. 1.3) • No shared memory • Memory distributed equally between PEs • PEs communicate through binary tree network. • Each PE stores a copy of the program. • PE execute different instructions • Normally synchronous, but asychronous is also possible. • Only leaf and root PEs have I/O capabilities.

  27. Examples of Parallel Models (cont) • Symmetric Multiprocessors (SMPs) • Asychronous processors of the same size. • Shared memory • Memory locations equally distant from each PE in smaller machines. • Formerly called “tightly coupled multiprocessors”.

  28. Analysis of Parallel Algorithms • Elementary Steps: • A computational step is a basic arithmetic or logic operation • A routing step is used to move data of constant size between PEs using either shared memory or a link connecting the two PEs.

  29. Communication Time Analysis • Worst Case • Usually default for synchronous computation • Difficult to use for asynchronous computation for asynchronous computation • Average Case • Usual default for asynchronous computation • Only occasionally used with synchronous co • Important to distinguish which is being used • Often “worst case” synchronous timings are compared to “average case” asynchronous timings. • Default in Akl’s text is “worst case” • Default in Casanova et.al. text is probably “average case”

  30. Shared Memory Communication Time • Memory Accesses – Two Memory Accesses • P(i) writes datum to a memory cell. • P(j) reads datum in that memory cell • Uniform Analysis • Charge constant time for memory access • Unrealistic, as shown for PRAM in Chapter 2 of Akl’s book • Non-Uniform Analysis • Charge O(lg M) cost for a memory access, where M is the number of memory cells • More realistic, based on how memory is accessed • Traditional to use Uniform Analysis • Considering a second variable M makes the analysis complex • Most non-constant time algorithms have the same complexity using either the uniform and non-uniform analysis.

  31. Communication through links • Assumes constant time to communicate along one link. • An earlier paper (mid-90?) estimates a communication step along a link to be on order of 10-100 times slower than a computational step. • When two processors are not connected, routing a datum between the two processors requires O(m) time units, where m is the number of links along path used.

  32. Running/Execution Time • Running Time for algorithm • Time between when first processor starts executing & when last processor terminates. • Either worst case or average case can be used. • Often not stated which is used • Normally worst case for synchronous & average case for asynchronous • An upper bound is given by the number of steps of the fastest algorithm known for the problem. • The lower bound gives the fewest number of steps possible to solve a worst case of the problem • Time optimal algorithm: • Algorithm in which the number of steps for the algorithm is the same complexity as the lower bound for problem. • Casanova, et. al. book uses term efficientinstead.

  33. Speedup • A measure of the decrease in running time due to parallelism • Let t1 denote the running time of the fastest (possible/known) sequential algorithm for the problem. • If the parallel algorithm uses p PEs, let tp denote its running time. • The speedup of the parallel algorithm using p processors is defined by S(1,p) = t1 / tp. • The goal is to obtain largest speedup possible. • For worst (average) case speedup, both t1 and tp should be worst (average) case, respectively.

  34. Optimal Speedup • Theorem: The maximum possible speedup for parallel computers with n PEs for “traditional” problems is n. • Proof (for “traditional problems”): • Assume a computation is partitioned perfectly into n processes of equal duration. • Assume no overhead is incurred as a result of this partitioning of the computation – (e.g., partitioning process, information passing, coordination of processes, etc), • Under these ideal conditions, the parallel computation will execute n times faster than the sequential computation. • The parallel running time is ts /n. • Then the parallel speedup of this computation is S(n,1) = ts /(ts /n) = n

  35. Linear Speedup • Preceding slide argues that S(n,1)  n and optimally S(n,1) = n. • This speedup is called linear since S(n,1) = (n) • The next slide formalizes this statement and provides an alternate proof.

  36. Speedup Folklore Theorem Theorem: For a traditional problem, the maximum speedup possible for a parallel algorithm using p processors is at most p. That is, S(1,p)  p. An Alternate Proof (for “traditional problems”) • Let t1 and tp be as above and assume t1/tp > p • This parallel algorithm can be simulated by a sequential computer which executes each parallel step with p serial steps. • The sequential algorithm runs in p tp time, which is faster than the fastest sequential algorithm since ptp< t1 • This contradiction establishes the “theorem”. Note: We will see that this result is not valid for some non-traditional problems.

  37. Speedup Usually Suboptimal • Unfortunately, the best speedup possible for most applications is much less than n, as • Assumptions in first proof are usually invalid. • Usually some portions of algorithm are sequential, with only one PE active. • Other portions of algorithm may allow a non-constant number of PEs to be idle for more than a constant number of steps. • E.g., during parts of the execution, many PEs may be waiting to receive or to send data.

  38. Speedup Example • Example: Compute the sum of n numbers using a binary tree with n leaves. • Takes tp = lg n steps to compute the sum in the root node. • Takes t1 = n-1 steps to compute the sum sequentially. • the number of processors is p =2n-1, so for n>1, S(1,p) = n/(lg n) < n < 2n - 1 = p • Speedup is not optimal.

  39. Optimal Speedup Example • Example: Computing the sum of n numbers optimally using a binary tree. • Assume binary tree has n/(lg n) leaves on the tree. • Each leaf is given lg n numbers and computes the sum of these numbers initially in (lg n) time. • The overall sum of the values in the n/(lg n) leaves are found next, as in preceding example, in lg(n/lg n) = (lg n) time. • Then tp = (lg n) and S(1,p) = (n/lg n) = cp for some constant c since p = 2(n/lg n) -1. • The speedup of this parallel algorithm is optimal, as the Folklore Speedup Theorem is valid for traditional problems like this one so speedup  p.

  40. Superlinear Speedup • Superlinear speedup occurs when S(n) > n • Most texts other than Akl’s argue that • Linear speedup is the maximum speedup obtainable. • The preceding “optimal speedup proof” is used to argue that superlinearity is impossible • Linear speedup is optimal for traditional problems • Those stating that superlinearity is impossible claim that any ‘superlinear behavior’ occurs due the following type of natural reasons: • the extra memory in parallel system. • a sub-optimal sequential algorithm was used. • Random luck, e.g., in case of an algorithm that has a random aspect in its design (e.g., random selection)

  41. Superlinearity (cont) • Some problems cannot be solved without use of parallel computation. • It seems reasonable to consider parallel solutions to these to be “superlinear”. • Speedup would be infinite. • Examples include many large software systems • Data too large to fit in the memory of a sequential computer • Problem contain deadlines that must be met, but the required computation is too great for a sequential computer to meet deadlines. • Those not accepting superlinearity discount these, as indicated in previous slide.

  42. Superlinearity (cont.) • Textbook by Casanova, et. al. discuss superlinearity primarily on pgs 11-12 and in Exercise 4.3 on pg 140. • Accepts that superlinearity is possible • Indicates natural causes (e.g., larger memory) as primary (perhaps only) reason for superlinearity. • Treats cases of infinite speedup (where sequential processor can not do the job) as an anomaly. • E.g., two men can carry a piano upstairs, but one man can not do this, even if given unlimited time.

  43. Some Non-Traditional Problems • Problems with any of the following characteristics are “non-traditional” • Input may arrive in real time • Input may vary over time • For example, find the convex hull of a set of points, where points can be added & removed over time. • Output may effect future input • Output may have to meet certain deadlines. • See discussion on page 16-17 of Akl’ text • See Chapter 12 of Akl’s text for additional discussions.

  44. Other Superlinear Examples • Selim Akl has given a large number of examples of different types of superlinear algorithms. • Example 1.17 in Akl’s online textbook (pg 17). • See last Chapter (Ch 12) of his online text. • See his 2007 book-chapter titled “Evolving Computational Systems” • To be posted to website shortly. • Some of his conference & journal publications

  45. A Superlinearity Example • Example: Tracking • 6,000 streams of radar data arrive during each 0.5 second and each corresponds to a aircraft. • Each must be checked against the “projected position box” for each of 4,000 tracked aircraft • The position of each plane that is matched is updated to correspond with its radar location. • This process must be repeated each 0.5 seconds • A sequential processor can not process data this fast.

  46. Slowdown Folklore Theorem: If a computation can be performed with p processors in time tp, with q processors in time tq, and q < p, then tptq (1 + p/q) tp Comments: • The new run time tq is O(tp p/q) • This result puts a tight bound on the amount of extra time required to execute an algorithm with fewer PEs. • Establishes a smooth degradation of running time as the number of processors decrease. • For q = 1, this is essentially the speedup theorem, but with a less sharp constant. • Similar to Proposition 1.1, pg 11, in Canasova, et. al. textbook.

  47. Proof: (for traditional problems) • Let Wi denote the number of elementary steps performed collectively during the i-th time period of the algorithm (using p PEs). • Then W = W1 + W2 + ... + Wk (where k=tp) is the total number of elementary steps performed collectively by the processors. • Since all p processors are not necessarily busy all the time, W  p tp . • With q processors and q < p, we can simulate the Wi elementary steps of i-th time period by having each of the q processors do Wi /q elementary steps. • The time taken for the simulation is tq W1 /q + W2 /q + ...+ Wk /q  W1 /q +1 + ...+ Wk /q +1  tp + p tp /q

  48. A Non-Traditional Example for “Folklore Theorems” • Problem Statement: (from Akl, Example 1.17) • n sensors are each used to repeat the same set {s1, s2, ..., sn } of measurements n times at an environmental site. • It takes each sensor n time units to make each measurement. • Each si is measured in turn by another sensor, producing n distinct cyclic permutations of the sequence S = (s1, s2, ..., sn) • The sequences produced by 3 sensors are (s1, s2, s3), (s3, s1 ,s2), (s2 ,s3,s1)

  49. Nontraditional Example (cont.) • The stream of data from each sensor is sent to a different processor, with all corresponding si values in each stream arriving at the same time. • The data from a sensor is consistently sent to the same processor. • Each si must be read and stored in unit time or stream is terminated. • Additionally, the minimum of the n new data values that arrive during each n steps must be computed, validated, and stored. • If minimum is below a safe value, process is stopped • The overall minimum of the sensor reports over all n steps is also computed and recorded.

  50. Sequential Processor Solution • A single processor can only monitor the values in one data stream. • After first value of a selected stream is stored, it is too late to process the other n-1 items that arrived at the same time in other streams. • A stream remains active only if its preceding value was read. • Note this is the best we can do to solve problem • We assume that a processor can read a value, compare it to the smallest previous value received, and update its “smallest value” in one unit of time. • Sequentially, the computation takes n computational steps and n(n-1) time unit delays between arrivals, or exactly n2 time units to process all n streams. • Summary: Calculating first minimum takes n2 time units

More Related