1 / 137

Parallel Real-Time Systems

Parallel Real-Time Systems. Parallel Computing Overview. References (Will be expanded as needed ). Website for Parallel & Distributed Computing: www.cs.kent.edu/~jbaker/PDC-F08/ Selected slides from “Introduction to Parallel Computing”

blenda
Download Presentation

Parallel Real-Time Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Real-Time Systems Parallel Computing Overview

  2. References(Will be expanded as needed) • Website for Parallel & Distributed Computing: www.cs.kent.edu/~jbaker/PDC-F08/ • Selected slides from “Introduction to Parallel Computing” • Michael Quinn, Parallel Programming in C with MPI and Open MP, McGraw Hill, 2004. • Chapter 1 is posted on website • Selim Akl, “Parallel Computation: Models and Methods”, Prentice Hall, 1997, Updated online version available on website.

  3. Outline • Why use parallel computing • Moore’s Law • Modern parallel computers • Flynn’s Taxonomy • Seeking Concurrency • Data clustering case study • Programming parallel computers

  4. Why Use Parallel Computers • Solve compute-intensive problems faster • Make infeasible problems feasible • Reduce design time • Solve larger problems in same amount of time • Improve answer’s precision • Reduce design time • Increase memory size • More data can be kept in memory • Dramatically reduces slowdown due to accessing external storage increases computation time • Gain competitive advantage

  5. 1989 Grand Challenges to Computational Science Categories • Quantum chemistry, statistical mechanics, and relativistic physics • Cosmology and astrophysics • Computational fluid dynamics and turbulence • Materials design and superconductivity • Biology, pharmacology, genome sequencing, genetic engineering, protein folding, enzyme activity, and cell modeling • Medicine, and modeling of human organs and bones • Global weather and environmental modeling

  6. Weather Prediction • Atmosphere is divided into 3D cells • Data includes temperature, pressure, humidity, wind speed and direction, etc • Recorded at regular time intervals in each cell • There are about 5×103 cells of 1 mile cubes. • Calculations would take a modern computer over 100 days to perform calculations needed for a 10 day forecast • Details in Ian Foster’s 1995 online textbook • Design & Building Parallel Programs • Included in Parallel Reference List, which will be posted on website.

  7. Moore’s Law • In 1965, Gordon Moore [87] observed that the density of chips doubled every year. • That is, the chip size is being halved yearly. • This is an exponential rate of increase. • By the late 1980’s, the doubling period had slowed to 18 months. • Reduction of the silicon area causes speed of the processors to increase. • Moore’s law is sometimes stated: “The processor speed doubles every 18 months”

  8. Micros Speed (log scale) Supercomputers Mainframes Minis Time Microprocessor Revolution Moore's Law

  9. Some Definitions • Concurrent – Sequential events or processes which seem to occur or progress at the same time. • Parallel –Events or processes which occur or progress at the same time • Parallel computing provides simultaneous execution of operations within a single parallel computer • Distributed computing provides simultaneous execution of operations across a number of systems.

  10. Flynn’s Taxonomy • Best known classification scheme for parallel computers. • Depends on parallelism it exhibits with its • Instruction stream • Data stream • A sequence of instructions (the instruction stream) manipulates a sequence of operands (the data stream) • The instruction stream (I) and the data stream (D) can be either single (S) or multiple (M) • Four combinations: SISD, SIMD, MISD, MIMD

  11. SISD • Single Instruction, Single Data • Usual sequential computer is primary example • i.e., uniprocessors • Note: co-processors don’t count as more processors • Concurrent processing allowed • Instruction prefetching • Pipelined execution of instructions • Independent concurrent tasks can execute different sequences of operations.

  12. SIMD • Single instruction, multiple data • One instruction stream is broadcast to all processors • Each processor, also called a processing element (or PE), is very simplistic and is essentially an ALU; • PEs do not store a copy of the program nor have a program control unit. • Individual processors can be inhibited from participating in an instruction (based on a data test).

  13. SIMD (cont.) • All active processor executes the same instruction synchronously, but on different data • On a memory access, all active processors must access the same location in their local memory. • The data items form an array (or vector) and an instruction can act on the complete array in one cycle.

  14. SIMD (cont.) • Quinn calls this architecture a processor array. • Examples include • The STARAN and MPP (Dr. Batcher architect) • Connection Machine CM2, built by Thinking Machines).

  15. How to View a SIMD Machine • Think of soldiers all in a unit. • The commander selects certain soldiers as active. • For example, every even numbered row. • The commander barks out an order to all the active soldiers, who execute the order synchronously.

  16. MISD • Multiple instruction streams, single data stream • Primarily corresponds to multiple redundant computation, say for reliability. • Quinn argues that a systolic array is an example of a MISD structure (pg 55-57) • Some authors include pipelined architecture in this category • This category does not receive much attention from most authors, so we won’t discuss it further.

  17. MIMD • Multiple instruction, multiple data • Processors are asynchronous and can independently execute different programs on different data sets. • Communications are handled either • through shared memory. (multiprocessors) • by use of message passing (multicomputers) • MIMD’s are considered by many researchers to include the most powerful, least restricted computers.

  18. MIMD (cont. 2/4) • Have major communication costs • When compared to SIMDs • Internal ‘housekeeping activities’ are often overlooked • Maintaining distributed memory & distributed databases • Synchronization or scheduling of tasks • Load balancing between processors • The SPMDmethod of programming MIMDs • All processors to execute the same program. • SPMD stands for single program, multiple data. • Easy method to program when number of processors are large. • While processors have same code, they can each can be executing different parts at any point in time.

  19. MIMD (cont 3/4) • A more common technique for programming MIMDs is to use multi-tasking • The problem solution is broken up into various tasks. • Tasks are distributed among processors initially. • If new tasks are produced during executions, these may handled by parent processor or distributed • Each processor can execute its collection of tasks concurrently. • If some of its tasks must wait for results from other tasks or new data , the processor will focus the remaining tasks. • Larger programs usually require a load balancing algorithm to rebalance tasks between processors • Dynamic scheduling algorithms may be needed to assign a higher execution priority to time-critical tasks • E.g., on critical path, more important, earlier deadline, etc.

  20. MIMD (cont 4/4) • Recall, there are two principle types of MIMD computers: • Multiprocessors (with shared memory) • Multicomputers (message passing) • Both are important and will be covered in greater detail next.

  21. Multiprocessors(Shared Memory MIMDs) • Consists of two types • Centralized Multiprocessors • Also called UMA (Uniform Memory Access) • Symmetric Multiprocessor or SMP • Distributed Multiprocessors • Also called NUMA (Nonuniform Memory Access)

  22. Centralized Multiprocessors(SMPs)

  23. Centralized Multiprocessors(SMPs) • Consists of identical CPUs connected by a bus and to common block of memory. • Each processor requires the same amount of time to access memory. • Usually limited to a few dozen processors due to memory bandwidth. • SMPs and clusters of SMPs are currently very popular

  24. Distributed Multiprocessors

  25. Distributed Multiprocessors(or NUMA) • Has a distributed memory system • Each memory location has the same address for all processors. • Access time to a given memory location varies considerably for different CPUs. • Normally, uses fast cache to reduce the problem of different memory access time for processors. • Creates problem of ensuring all copies of the same data in different memory locations are identical.

  26. Multicomputers (Message-Passing MIMDs) • Processors are connected by a network • Usually an interconnection network • Also, may be connected by Ethernet links or a bus. • Each processor has a local memory and can only access its own local memory. • Data is passed between processors using messages, when specified by the program.

  27. Multicomputers (cont) • Message passing between processors is controlled by a message passing language (e.g., MPI, PVM) • The problem is divided into processes or tasks that can be executed concurrently on individual processors. • Each processor is normally assigned multiple tasks.

  28. Multiprocessors vs Multicomputers • Programming disadvantages of message-passing • Programmers must make explicit message-passing calls in the code • This is low-level programming and is error prone. • Data is not shared but copied, which increases the total data size. • Data Integrity: difficulty in maintaining correctness of multiple copies of data item.

  29. Multiprocessors vs Multicomputers (cont) • Programming advantages of message-passing • No problem with simultaneous access to data. • Allows different PCs to operate on the same data independently. • Allows PCs on a network to be easily upgraded when faster processors become available. • Mixed “distributed shared memory” systems exist • An example is a cluster of SMPs.

  30. Types of Parallel Execution • Data parallelism • Control/Job/Functional parallelism • Pipelining • Virtual parallelism

  31. Data Parallelism • All tasks (or processors) apply the same set of operations to different data. • Example: • Operations may be executed concurrently • Accomplished on SIMDs by having all active processors execute the operations synchronously. • Can be accomplished on MIMDs by assigning 100/p tasks to each processor and having each processor to calculated its share asynchronously. for i  0 to 99 do a[i]  b[i] + c[i] endfor

  32. Supporting MIMD Data Parallelism • SPMD (single program, multiple data) programming is not really data parallel execution, as processors typically execute different sections of the program concurrently. • Data parallel programming can be strictly enforced when using SPMD as follows: • Processors execute the same block of instructions concurrently but asynchronously • No communication or synchronization occurs within these concurrent instruction blocks. • Each instruction block is normally followed by a synchronization and communication block of steps

  33. MIMD Data Parallelism (cont.) • Strict data parallel programming is unusual for MIMDs, as the processors usually execute independently, running their own local program.

  34. Data Parallelism Features • Each processor performs the same data computation on different data sets • Computations can be performed either synchronously or asynchronously • Defn:Grain Size is the average number of computations performed between communication or synchronization steps • See Quinn textbook, page 411 • Data parallel programming usually results in smaller grain size computation • SIMD computation is considered to be fine-grain • MIMD data parallelism is usually considered to be medium grain

  35. Control/Job/Functional Parallelism • Independent tasks apply different operations to different data elements • First and second statements may execute concurrently • Third and fourth statements may execute concurrently a  2 b  3 m  (a + b) / 2 s  (a2 + b2) / 2 v  s - m2

  36. Control Parallelism Features • Problem is divided into different non-identical tasks • Tasks are divided between the processors so that their workload is roughly balanced • Parallelism at the task level is considered to be coarse grained parallelism

  37. Data Dependence Graph • Can be used to identify data parallelism and job parallelism. • See page 11. • Most realistic jobs contain both parallelisms • Can be viewed as branches in data parallel tasks • If no path from vertex u to vertex v, then job parallelism can be used to execute the tasks u and v concurrently. - If larger tasks can be subdivided into smaller identical tasks, data parallelism can be used to execute these concurrently.

  38. For example, “mow lawn” becomes • Mow N lawn • Mow S lawn • Mow E lawn • Mow W lawn • If 4 people are available to mow, then data parallelism can be used to do these tasks simultaneously. • Similarly, if several people are available to “edge lawn” and “weed garden”, then we can use data parallelism to provide more concurrency.

  39. Pipelining • Divide a process into stages • Produce several items simultaneously

  40. Consider the for loop: p[0]  a[0] for i  1 to 3 do p[i]  p[i-1] + a[i] endfor This computes the partial sums: p[0]  a[0] p[1]  a[0] + a[1] p[2]  a[0] + a[1] + a[2] p[3]  a[0] + a[1] + a[2] + a[3] The loop is not data parallel as there are dependencies. However, we can stage the calculations in order to achieve some parallelism. Compute Partial Sums

  41. Partial Sums Pipeline

  42. Virtual Parallelism • In data parallel applications, it is often simpler to initially design an algorithm or program assuming one data item per processor. • Particularly useful for SIMD programming • If more processors are needed in actual program, each processor is given a block of n/p or n/p data items • Typically, requires a routine adjustment in program. • Will result in a slowdown in running time of at least n/p. • Called Virtual Parallelism since each processor plays the role of several processors. • A SIMD computer has been built that automatically converts code to handle n/p items per processor. • Wavetracer SIMD computer.

  43. Slides from Parallel Architecture Section See www.cs.kent.edu/~jbaker/PDC-F08/ s

  44. References • Slides in this section are taken from the Parallel Architecture Slides at site www.cs.kent.edu/~jbaker/PDC-F08/ • Book reference is Chapter 2 of Quinn’s textbook.

  45. Interconnection Networks • Uses of interconnection networks • Connect processors to shared memory • Connect processors to each other • Different interconnection networks define different parallel machines. • The interconnection network’s properties influence the type of algorithm used for various machines as it affects how data is routed.

  46. Terminology for Evaluating Switch Topologies • We need to evaluate 4 characteristics of a network in order to help us understand their effectiveness • These are • The diameter • The bisection width • The edges per node • The constant edge length • We’ll define these and see how they affect algorithm choice. • Then we will introduce several different interconnection networks.

  47. Terminology for Evaluating Switch Topologies • Diameter – Largest distance between two switch nodes. • A low diameter is desirable • It puts a lower bound on the complexity of parallel algorithms which requires communication between arbitrary pairs of nodes.

  48. Terminology for Evaluating Switch Topologies • Bisection width – The minimum number of edges between switch nodes that must be removed in order to divide the network into two halves. • Or within 1 node of one-half if the number of processors is odd. • High bisection width is desirable. • In algorithms requiring large amounts of data movement, the size of the data set divided by the bisection width puts a lower bound on the complexity of an algorithm.

  49. Terminology for Evaluating Switch Topologies • Number of edges per node • It is best if the maximum number of edges/node is a constant independent of network size, as this allows the processor organization to scale more easily to a larger number of nodes. • Degree is the maximum number of edges per node. • Constant edge length? (yes/no) • Again, for scalability, it is best if the nodes and edges can be laid out in 3D space so that the maximum edge length is a constant independent of network size.

  50. Three Important Interconnection Networks • We will consider the following three well known interconnection networks: • 2-D mesh • linear network • hypercube • All three of these networks have been used to build commercial parallel computers.

More Related