Fundamentals of Parallel and Distributed Systems: Levels of Abstraction

COMP60611Fundamentals of Parallel and Distributed Systems Lecture 1 Levels of Abstraction and Implementation Options Len Freeman, Graham Riley Centre for Novel Computing School of Computer Science University of Manchester

Overview • Application-oriented Levels of Abstraction • Application Level • Specification Level • Algorithm Level • 'Implementation' Levels of Abstraction • Program Level • state-transition architecture • processes and threads (message-passing vs. data-sharing) • Realisation Level • development of parallel architecture • UMA, NUMA and COMA • Summary

Background We are primarily concerned with: The design of applications for Execution on parallel and distributed computers which give Correct results and good (high) performance: These are Concurrent systems Developing concurrent systems that do what they are supposed to do and which deliver high performance, where necessary, requires well-designed interactions between many different facets of computing, ranging from the applications themselves to the structure of the parallel computers which execute them. This interaction covers several Levels of Abstraction, which encompass fundamentally different views. The following five Levels of Abstraction demonstrate how these views gradually change from being application-oriented, at one end of the spectrum, to hardware-oriented, at the other.

Levels of Abstraction • The Application Level – here a relatively simple, and possibly informal, description of the application (i.e., the problem to be solved) is stated or developed. • In a weather prediction example, say for medium-range weather forecasting, this description might be something like: “Once (or perhaps twice) daily, forecast the state of the global weather system for a period of ten days from now. For operational reasons, the computation on which this forecast is based must be achieved within an 8 hour 'window' each day.”

Levels of Abstraction • The Specification Level – here the simple application description is turned into a formal specification (abstract model) of the application problem, expressed in a suitable mathematical notation. • For the weather prediction example, the global situation is modelled (roughly) by a small number of coupled partial differential equations (PDEs) representing conservation of momentum, energy and mass, and a state equation. These describe key continuous variables such as wind speed, density (including moisture content), temperature and pressure.

Levels of Abstraction • The Algorithm Level – here a systematic procedure for solving the problem is developed, based on some discrete data domain. In many cases, the problem itself cannot be solved directly, and so an approximation is used to solve a related problem. The discrete domains encountered here reflect the discrete nature of computer storage that will be encountered later during the development; nevertheless, they remain rather abstract entities. Parallelism arises in two basic forms: data-parallelism and task-parallelism. • For the weather prediction example, the global continuum is represented by a grid-point approximation (i.e. using spot values of the key variables at each grid-point). Progress in time is approximated by a series of discrete time-steps, governed by finite difference equations derived from the PDEs – mostly data-parallelism over the finite difference grid.

Levels of Abstraction • The Program Level – here the algorithm is expressed as a program, moving the concerns further towards the restrictions associated with real computer storage. Unfortunately, programming languages tend to reflect the hardware architecture of the computer that will execute the compiled code, often leading to 'unnatural' constraints on the algorithm. • A typical example for medium-range weather forecasting is the Met Office’s Unified Model (UM) code, which merges developments in both weather and climate modelling into a single meteorological FORTRAN program with (in the latest versions) message-passing parallel constructs in MPI (Message Passing Interface). The Integrated Forecasting System (IFS) of the European Centre for Medium-range Weather Forecasting (ECMWF) is another example. • The use of message-passing has forced the (parallel) algorithm in certain directions that may not have been followed if sequential hardware had continued to be used. The message-passing approach was adopted due to availability of specific parallel hardware.

Levels of Abstraction • The Architecture Level – here the program is implemented on parallel digital hardware. The object-code generated by the compiler, or embedded in the runtime system, causes low-level electronic transactions across the data paths of the hardware, between processors, memory modules and peripheral devices. • In the weather prediction example: in 2009, the Met Office bought a large (several thousand multicore processors) IBM Power 6 Cluster system which now runs the UM code in operational mode. The runtime message-passing system is the manufacturer's implementation of MPI. • Generation of suitable parallel code has not been straightforward; it has taken many years, and involved a considerable amount of expert manpower. Tools have been of limited assistance, being rudimentary or overly ambitious. Such experiences abound in the practice of parallel computing; our later focus is on trying to understand why this is so.

Numerical Weather Prediction • Roughly speaking, 1/3rd of the 6-8 hour daily compute-time is pre-processing, to establish initial conditions at each grid-point, 1/3rd is time-stepping iterations, to compute successive new weather states, and 1/3rd is post-processing, to derive the interesting qualitative conditions (cloud, rainfall, etc.). • Limiting factors are: • accuracy of computed initial state; • resolution of space and time; • quality (accuracy and divergence) of time-stepping iterative approximation. • The quality of a forecast can be measured post hoc by the length of time that (some of) its predictions remain within a given error of the actual weather.

Numerical Weather Prediction • In practice, parallelism of several thousand-fold is being used in this production system. Parallelism is exploited across the data 'spaces' (independent vertical columns of the atmosphere), and across the different forecasts of the EPS (Ensemble Prediction System). • Attention is increasingly focussed on additional pre-processing (to improve the computed initial states) and probabilistic algorithms for predicting the 'most likely' outcome based on ensemble forecasts.

Concurrent systems: Issues • Correct behaviour: • In the sense of verification (“did we build it right?”) as opposed to validation (“did we build the right thing?”) • Performance: • What limits the performance of a system? • What happens to performance as the number of processors/cores increases (this is termed scalability analysis) • In this module the focus is on the design phase: • What options do we have in building a system and how do we choose? • Why is developing a concurrent solution so much more difficult than developing a sequential solution?

A wider view… • Concurrent systems occur in many applications beyond computer systems • For example, airports, supermarkets, banking (e.g. ATM systems) • These involve queueing networks and transaction processing • Many of these systems contain complex computing systems too (ATM networks, reservation systems and associated databases) • Such systems do not necessarily compute a result, in the same way as the weather prediction system. Rather, they support some complex, on-going behaviour. • Mobile systems add further levels of complexity due to their dynamic nature • For example, Mobile telecommunications systems, Wireless access for mobile computing (PDAs and laptops) and Airport Traffic Control systems • The underlying basic concurrency issues remain the same.

Summary • There are several key aspects to concurrent systems • Principally we will be concerned with correctness and performance • Each aspect lends itself to a different treatment at the design stage • Need to use the appropriate techniques in order to construct good concurrent systems • In this module we will look at three techniques: • Formal behaviour modelling (using the FSP language and tools) • Performance modelling (of algorithms) • Queueing theory and Discrete Simulation

On the Nature ofDigital Systems • Programs and their hardware realisations, whether parallel or sequential, are essentially the same; i.e., programmedstate-transition. • For example, a sequential system (abstract machine or concrete hardware) comprises a state-transition machine (processor) attached to a memory which is divided into two logical sections; one fixed (the code state) and one changeable (the data state). • The code state contains instructions (or statements) that can be executed in a data-dependent sequence, changing the contents of the data state in such a way as to progress the required computation. The sequence is controlled by a program counter. • The data state contains the variables of the computation. These start in an initial state, defining the input of the computation, and finish by holding its logical output. In programming terms, the data state contains the data structures of the program.

Sequential Digital Systems • Performance in the above model is governed by a state-transition cycle. The program counter identifies the 'current' instruction. This is fetched from the code state and then executed. Execution involves reading data from the data state, performing appropriate operations, then writing results back to the data state and assigning a new value to the program counter. • To a first approximation, execution time will depend on the exact number and sequence of these actions (we ignore, for the moment, the effect of any memory buffering schemes). • This is the programmer's model of what constitutes a sequential computation. It is predominantly a model of memory, and the programmer's art is essentially to map algorithms into memory in such a way that they will execute with good performance.

Code – fixed memory Data – changeable memory Sequential Digital Systems • It will be convenient to think of memory in diagrammatic terms. In this sense, the model can be visualised as follows: • The Code has an associated, data-dependent locus of control, governed by the program counter; there is also some associated processor state which we roll into the Data, for the moment. This whole memory image is called a process (after Unix terminology; other names are used).

Parallel Execution • It is possible to execute more than one process concurrently, and to arrange for the processes to co-operate in solving some large problem using a message-passing protocol (cf. Unix pipes, forks, etc.). • However, the 'start-up' costs associated with each process are large, mainly due to the cost of protecting its data memory from access by any other process. As a consequence, a large parallel grain size is needed. • An alternative is to exploit parallelism within a single process, using some form of 'lightweight' process, or thread. This should allow use of a smaller parallel grain size, but carries risks associated with sharing of data. • We shall look at the case where just two processors are active. This can be readily generalised to a larger number of processors.

Two-fold Parallelism • In the message-passing scheme, two-fold parallelism is achieved by simultaneous activation of two 'co-operating' processes. • Each process can construct messages (think of these as values of some abstract data type) and send them to other processes. A process has to receive incoming messages explicitly (this restriction can be overcome, but it is not a straightforward matter to do so). • The message-passing scheme is illustrated in the following diagram:

Two-fold Parallelism • In the message-passing scheme, two-fold parallelism is achieved by simultaneous activation of two 'co-operating' processes. • Each process can construct messages (think of these as values of some abstract data type) and send them to other processes. A process has to receive incoming messages explicitly (this restriction can be overcome, but it is not a straightforward matter to do so). • The message-passing scheme is illustrated in the following diagram: • Process A

Two-fold Parallelism • In the message-passing scheme, two-fold parallelism is achieved by simultaneous activation of two 'co-operating' processes. • Each process can construct messages (think of these as values of some abstract data type) and send them to other processes. A process has to receive incoming messages explicitly (this restriction can be overcome, but it is not a straightforward matter to do so). • The message-passing scheme is illustrated in the following diagram: • Process B

Two-fold Parallelism • Within a single process, an obvious way of allowing two-fold parallel execution is to allow two program counters to control progress through two separate, but related, code states. To a first approximation, the two streams of instructions will need to share the sequential data state. • When, as frequently happens, Code A and Code B are identical, this scheme is termed single-program, multiple-data (SPMD).

Two-fold Parallelism • Within a single process, an obvious way of allowing two-fold parallel execution is to allow two program counters to control progress through two separate, but related, code states. To a first approximation, the two streams of instructions will need to share the sequential data state. • Thread A • When, as frequently happens, Code A and Code B are identical, this scheme is termed single-program, multiple-data (SPMD).

Two-fold Parallelism • Within a single process, an obvious way of allowing two-fold parallel execution is to allow two program counters to control progress through two separate, but related, code states. To a first approximation, the two streams of instructions will need to share the sequential data state. • Thread B • When, as frequently happens, Code A and Code B are identical, this scheme is termed single-program, multiple-data (SPMD).

Privatising Data • Each stream of instructions (from Code A and from Code B) will issue references to the shared data state using a global addressing scheme (i.e. the same address, issued from whichever stream of instructions, will access the same shared data memory location). • There are obvious problems of contention and propriety associated with this sharing arrangement; it will be necessary to use locks to protect any variable that might be shared, and these will affect performance. • Hence, it is usual to try and identify more precisely which parts of the data state really need to be shared; then at least the use of locks can be confined to those variables (and only those variables) that really need the protection.

Privatising Data • In general, there will be some variables that are only referenced from one instruction stream or the other. Assuming that these can be identified, we can segregate the data state into three segments, as follows: • We can then isolate the execution objects, thread A and thread B, within the process, which have the Shared Data as their only part in common.

Privatising Data • In general, there will be some variables that are only referenced from one instruction stream or the other. Assuming that these can be identified, we can segregate the data state into three segments, as follows: • Thread A • We can then isolate the execution objects, thread A and thread B, within the process, which have the Shared Data as their only part in common.

Privatising Data • In general, there will be some variables that are only referenced from one instruction stream or the other. Assuming that these can be identified, we can segregate the data state into three segments, as follows: • Thread B • We can then isolate the execution objects, thread A and thread B, within the process, which have the Shared Data as their only part in common.

Identifying Private Data • Determining which variables fall into which category (private-to-A; private-to-B; shared) is non-trivial. In particular, the required category for a certain variable may depend on the values of variables elsewhere in the data state. • In the general case (more than two threads) the procedure for identifying categories must distinguish the following: • Shared variable --- can potentially be accessed by more than one thread. • Private variable (to thread X) --- can only ever be accessed by thread X. • How to achieve this distinction in acceptable time is an interesting research problem.

Parallel Programming Language Requirements • Consider the additional programming language constructs that may be necessary to handle parallelism in either of the ways we have described. • Message-Passing (between processes): • means to create new processes; • means to place data in a process; • means to send/receive messages; • means to terminate 'dead' processes. • Data-Sharing (between threads in one process): • means to create new threads; • means to share/privatise data; • means to synchronise shared accesses; • means to terminate 'dead' threads.

The Architecture Level The final part of this jigsaw is to study how parallel programs get executed in practical parallel and distributed computers. Recall the nature of the parallel programming constructs introduced earlier; we consider their implementation, in both the run-time software library and the underlying hardware.

Conventional View of Computer Architecture We start by recalling the traditional view of a state-based computer, as first expounded by John von Neumann (1945). A finite word-at-a-time memory is attached to a Central Processing Unit (CPU) (a “single Core” processor). The memory contains the fixed code and initial data defining the program to be executed. The CPU contains a Program Counter (PC) which points to the first instruction to be executed. The CPU follows the instruction cycle, similar to the state-transition cycle described earlier for the Program Level. The memory is accessed solely by requests of the following form: <read,address> or <write,data,address>

Conventional View of Computer Architecture This arrangement is conveniently illustrated by the following diagram:

Conventional View of Computer Architecture • A memory address is a unique, global identifier for a memory location – the same address always accesses (the value held in) the same location of memory. The range of possible addresses defines the address space of the computer. • All addresses 'reside' in one logical memory; there is therefore only one interface between CPU and memory. • Memory is often organised as a hierarchy, and the address space can be virtual; i.e. two requests to the same location may not physically access the same part of memory --- this is what happens, for example, in systems with cache memory, or with a disk-based paged virtual memory. Access times to a virtual-addressed memory vary considerably.

Parallel Computer Architecture • Imagine that we have a parallel program consisting of two threads or two processes, A and B. We need two program counters to run them. There are two ways of arranging this: • Time-sharing – we use a sequential computer, as described above, and arrange that A and B are periodically allowed to use the (single) CPU to advance their activity. When the “current” thread or process is deselected from the CPU, its entire state is remembered so that it can restart from the latest position when it is reselected. • Multi-processor – we build a structure with two separate CPUs, both accessing a common memory. Code A and Code B will be executed on the two different CPUs. • Time-sharing is slow (and, hence, not conducive to high performance), but does get used in certain circumstances. However, we shall ignore it from now on.

Parallel Computer Architecture In diagrammatic form, the multi-processor appears as follows:

Parallel Computer Architecture • The structure on the previous slide is known as a shared memory multiprocessor, for obvious reasons. • The memory interface is the same as for the sequential architecture, and both memory and addresses retain the same properties.

Parallel Computer Architecture • However, access to the common memory is subject to contention, when both CPUs try to access memory at the same time. The greater the number of parallel CPUs, the worse this contention problem becomes. • A commonly used solution is to split the memory into multiple banks which can be accessed concurrently. This arrangement is shown below:

Parallel Computer Architecture • In this arrangement, the interconnect directs each memory access to an appropriate memory bank, according to the required address. Addresses may be allocated across the memory banks in many different ways (interleaved, in blocks, etc.). • The interconnect could be a complex switch mechanism, with separate paths from each CPU to each memory bank, but this is expensive in terms of physical wiring. • Hence, cheap interconnect schemes, such as a bus, tend to be used. However, these limit the number of CPUs and memory banks that can be connected together (to a maximum of around 30).

Parallel Computer Architecture • Two separate things motivate the next refinement: • Firstly, we can double the capacity of a bus by physically co-locating a CPU and a memory bank and letting them share the same bus interface. • Secondly, we know from analysis of algorithms (and the development of programs from them) that many of the required variables are private to each thread. By placing private variables in the co-located memory, we can avoid having to access the bus in the first place. • Indeed, we don't really need to use a bus for the interconnect. • The resulting structure has the memory physically distributed amongst the CPUs. Each CPU-plus-memory resembles a von Neumann computer, and the structure is called a distributed memory multicomputer.

Parallel Computer Architecture The architecture diagram for a distributed memory multicomputer is shown below: memory1

Distributed Computer Architecture • Some distributed memory multicomputer systems have a single address space in which the available addresses are partitioned across the memory banks. • These typically require special hardware support in the interconnect • Others have multiple address spaces in which each CPU is able to issue addresses only to its 'own' local memory bank. • Finally, Interconnection networks range from very fast, very expensive, specialised hardware to ‘the internet’.

Parallel Computer Architecture • The operation of the single address space version of this architecture, known as distributed shared memory (DSM) is logically unchanged from the previous schemes (shared memory multiprocessor). • However, some memory accesses only need to go to the physically attached local memory bank, while others, according to the address, have to go through the interconnect. This leads to different access times for different memory locations, even in the absence of contention. • This latter property makes distributed shared memory a non-uniform memory access (NUMA) architecture. For high performance, it is essential to place code and data for each thread or process in readily accessible memory banks. • In multiple address space versions of this architecture (known as distributed memory or DM), co-operative parallel action has to be implemented by message-passing software (at least at the level of the run-time system).

Parallel Computer Architecture • Note that cache memories can be used to solve some of the problems raised earlier; e.g. to reduce the bus traffic in a shared memory architecture. • Many systems have a NUMA structure, but their single address space is virtual. This arrangement is sometimes referred to as virtual shared memory (VSM). • The effect of VSM can be implemented on a DM system entirely in software, in which case it is usually called distributed virtual shared memory (DVSM). • Most very large systems today consist of many ‘shared memory, multicore nodes, connected via some form of interconnect.

The advent of multi-core • A modern multi-core processor is essentially a NUMA shared memory multiprocessor on a chip. • Consider a recent offering from AMD, the Opteron quad-core processor • The next slides shows a schematic of a single quad-core processor and a shared memory system consisting of four quad-core processors, i.e. a “quad-quad-core” system, with a total of 16 cores. • The number of cores is rising rapidly (to keep up with “Moore’s law”. • Large systems connect thousands of multi-core processors

Processor: Quad-Core AMD Opteron Source: www.amd.com, Quad-Core AMD Opteron Product Brief

AMD Opteron 4P server architecture Source: www.amd.com, AMD 4P Server and Workstation Comparison

Application-Oriented View • Solving a computational problem involves design and development at several distinct Levels of Abstraction. The totality of issues to be considered well exceeds the capabilities of a normal human being. • At Application Level, the description of the problem to be solved is informal. A primary task in developing a solution is to create a formal (mathematical) application model, or specification. Although formal and abstract, an application model implies computational work that must be done and so ultimately determines the performance that can be achieved in an implementation. • Algorithms are procedures, based on discrete data domains, for solving (approximations to) computational problems. An algorithm is also abstract, although it is generally more clearly related to the computer that will implement it than is the corresponding specification.

Implementation-Oriented View • Concrete implementation of an algorithm is achieved through the medium of a program, which determines how the discrete data domains inherent in the algorithm will be laid out in the memory of the executing computer, and also defines the operations that will be performed on that data, and their relative execution order. Interest is currently focused on parallel execution using concurrent programming languages, based on multiple active processes (with message-passing) or multiple threads (with data-sharing). • Performance is ultimately dictated by the available parallel platform, via the efficiency of its support for processes or threads. Hardware architectures are still evolving, but a clear trend is emerging towards distributed memory structure, with various levels of support for sharing data at the Program Level. • Correctness requires an algorithm which is correct with respect to the specification, AND a correct implementation of the algorithm (as well as the correct operation of computer hardware and network infrastructure).

Fundamentals of Parallel and Distributed Systems: Levels of Abstraction

Fundamentals of Parallel and Distributed Systems: Levels of Abstraction

Presentation Transcript

Performance of parallel and distributed systems

MSCS6060 Parallel and Distributed Systems

Extreme scale parallel and distributed systems

COMP60611 Fundamentals of Parallel and Distributed Systems

Parallel and Distributed Systems

COMP60611 Fundamentals of Parallel and Distributed Systems

Performance of parallel and distributed systems

COMP60611 Fundamentals of Parallel and Distributed Systems

Parallel and Distributed Systems

Parallel and Distributed Systems

COMP60611 Fundamentals of Parallel and Distributed Systems

COMP60611 Fundamentals of Concurrency

COMP60611 Fundamentals of Parallel and Distributed Systems

COMP60611: Fundamentals of Parallel and Distributed Systems Direct Reading 1 – Therac-25

COMP60611 Fundamentals of Parallel and Distributed Systems

COMP60611 Fundamentals of Parallel and Distributed Systems

COMP60611 Fundamentals of Parallel and Distributed Systems

COMP60611 Fundamentals of Parallel and Distributed Systems