COS 497 - Cloud Computing • 2. Distributed Computing
The operation of the Cloud lies in the parallel/distributed computing paradigm. Next slides give an overview of this mode of computing. Reference: https://computing.llnl.gov/tutorials/parallel_comp/
First, some jargon … • Parallel computing is a form of computation in which a large problem is divided into a number of smaller, discrete, relatively-independent parts, with their execution carried out simultaneously. • There are several different forms (or granularities) of parallel computing: bit-level, instruction level, data, and task parallelism. • Parallelism has been used for many years, but interest in it has increased in recent years, mainly in the form of multi-core processors. • Parallel computers can be roughly classified according to the level at which the hardware supports parallelism, with multi-coreand multi-processor computers having multiple processing units within a single machine, • while clusters, grids and clouds use multiple, distributedcomputers to work on the same task.
Synchronization The coordination of parallel tasks in real time, very often associated with communications. Often implemented by establishing a synchronization point within an application where a task may not proceed further until another task(s) reaches the same or logically equivalent point. • Granularity • In parallel computing, granularity is a qualitative measure of the ratio of computation to communication. • Coarse:relatively large amounts of computational work are done between communication events • Fine:relatively small amounts of computational work are done between communication events
Traditional mode of computation – instructions executed in sequence Example
Parallel Computing • Problems (i.e. programs) are divided into smaller parts and executed simultaneously/concurrently on different processors. Sequential Parallel
Parallel computers can be roughly classified according to the level at which the hardware supports parallelism. • For the Cloud, an important class is distributed systems. • Distributed System: a loosely-coupled form of parallel computing • Use multiplecomputers to perform computations in parallel. • Computers are connected via a network – computers are distributed in “space”. • Distributed System: Use a “distributed memory”. • Massage passing is typically used to exchange information between the processors as each one has its own private memory.
Flynn’s Taxonomy for Computer Architectures Instructions Single (SI) Multiple (MI) Single (SD) Data Multiple (MD)
SISD – Single Instruction stream, Single Data stream Processor D D D D D D D Instructions Single processor executes a single stream of instructions (i.e. a single program), operating on a single set of data. Traditional form of computation.
SIMD – Single Instruction stream, Multiple Data streams Processor D0 D0 D0 D0 D0 D0 D0 D1 D1 D1 D1 D1 D1 D1 D2 D2 D2 D2 D2 D2 D2 D3 D3 D3 D3 D3 D3 D3 D4 D4 D4 D4 D4 D4 D4 … … … … … … … Dn Dn Dn Dn Dn Dn Dn Instructions A number of processors execute copies of the same program (Single Instruction stream), but with different sets of data (Multiple Data streams). A type of parallel computer. Two varieties: Processor Arrays and Vector Pipelines. And Cloud!
MIMD - Multiple Instruction streams, Multiple Data streams Processor Processor D D D D D D D D D D D D D D Instructions Instructions A number of processors execute different programs (Multiple Instruction streams) with different sets of data (Multiple Data streams). Provides true parallel processing. Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers, multi-core PCs. And Cloud!
Memory Typology: Shared Processor Processor Memory Processor Processor Programs executing on different processors share, i.e. have access to, the same common memory (usually via a bus) and communicate with each other via this memory. Memory accesses need to be synchronized.
Shared memory Private memory P1 Pn P0
Memory Typology: Distributed Processor Memory Processor Memory Network Processor Memory Processor Memory Each processor has its own private memory. Computational tasks can only operate on local data, and if remote data is required, the computational task must communicate with one or more remote processors via a network link.
Memory Typology: Hybrid – Distributed Shared Memory Processor Memory Processor Memory Processor Processor Network Processor Memory Processor Memory Processor Processor Each processor of a cluster has access to a large shared memory, In addition each processor has access to remote data via a network link.
Programming Models Mirror the Hardware Models
Patterns for Parallelism • Parallel computing has been around for decades. • Here are some well-known architectural patterns …
Master/Slaves master One of the simplest parallel programming paradigms is "master/slaves". The main computation (the master) generates many sub-problems, which are fired off to be executed by "someone else" (slave). The only interaction between the master and slave computations is that the master starts the slave computation, and the slave computation returns the result to the master. There are no significant dependencies among the slave computations. slaves
Producer/Consumer Flow P C P C P C P C P C P C Producer “threads” create work items. Consumer “threads” process them. Can be “daisy-chained”, i.e. pipelined.
Work Queues Used in the Cloud, e.g. Windows Azure P C shared queue P C W W W W W P C The work queue parallel processing model consists of a queue of work items and processes to produce and complete these work items. Each participant can take a work item off the queue, and if necessary, each participant can add newly generated work items to the queue. As each participant completes its work item, it does not wait for some participant to assign it a new task, but instead takes the next item off the work queue and begins execution.
Cloud Computing A cloud provider has 100s of thousands of nodes (aka servers). Cloud computing is massively-parallel computing with multi-processors (i.e. many multi-core processors) In principle, your application may run on one, two, … thousands of servers (i.e. processors) For your application to run on one, two, … thousands of servers, your application code or data must be parallelized. I.e. Split up into independent or relatively independent parts.
Parallelizing codeis real hard work! - Splitting a program up into relatively independent parts, which communicate now and then with each other. Multi-threaded programs are a form of parallelism. But the general case is still a big research problem. Splitting data up into smaller chunks is easy, though. Most Cloud applications are based on data parallelism.
For parallel processing to work, the computational problem should be able to - Be broken apart into discrete pieces of work that can be solved simultaneously. - Execute multiple program instructions at any time - Be solved in less time with multiple compute resources than with a single compute resource. The compute resources might be: - A single computer with multiple processors - An arbitrary number of computers connected by a network - A combination of both – Cloud Computing!
Divide and Conquer Approach used by MapReduce Popular cloud approach “Work” Partition w1 w2 w3 “worker” “worker” “worker” r1 r2 r3 Combine “Result”
Different Workers? May be • Different threads in the same core • Different cores in the same CPU • Different CPUs in a multi-processor system • Different machines in a distributed system
Parallelization Problems • How do we assign work units to workers? • What if we have more work units than workers? • What if workers need to share partial results? • How do we aggregate partial results? • How do we know all the workers have finished? • What if workers die? What is the common theme in all of these problems?
Common Theme? • Parallelization problems can arise from • Communication between workers • Access to shared resources (e.g. data) • Thus, we need a synchronization mechanism! Some mechanism that allows workers to synchronize (i.e. keep in step) themselves with other workers. • This is tricky • Finding bugs is hard • Solving bugs is even harder
Managing Multiple Workers • Difficult because • (Often) do not know the order in which workers run • (Often) do not know where the workers are running • (Often) do not know when workers interrupt each other • Thus, we need synchronization primitives (used in operating systems!) • Semaphores (lock, unlock) • Conditional variables (wait, notify, broadcast) • Barriers • Still, lots of insidious (i.e. mnogo nasty!) problems: • Deadlock, livelock, race conditions, ... • Moral of the story: be careful! • Even trickier if the workers are on different machines
Parallel Programming Models There are several parallel programming models in common use: - Shared Memory - Threads - Distributed Memory / Message Passing - Data Parallel - Hybrid - Single Program Multiple Data (SPMD) - Multiple Program Multiple Data (MPMD) Parallel programming models exist as an abstraction above hardware and memory architectures. Although it might not seem apparent, these models are notspecific to a particular type of machine or memory architecture. In fact, any of these models can (theoretically) be implemented on any underlying hardware.
Shared Memory Model In this programming model, tasks share a common address space, which they read and write to asynchronously (aka when they need to.) Various mechanisms, such as locks/semaphores,may be used to control access to the shared memory – synchronizing access to memory. An advantage of this model from the programmer's point of view is that the notion of data "ownership" is lacking, so there is no need to specify explicitly the communication of data between tasks.
Threads Model This programming model is a type of shared memory programming. In the threads model of parallel programming, a single "heavyweight" process (i.e. program) can have multiple "lightweight", concurrentexecution paths, i.e. threads. Example A program a.out is scheduled to run by the operating system. - a.outloads, and acquires all of the necessary system and user resources to run. This is the "heavyweight" process. a.outperforms some sequential work, and then creates a number of internal tasks, i.e. threads,that can be scheduled and run by the operating system concurrently.
Each thread has local data, but also, shares the entire resources of a.out. • - This saves the overhead associated with replicating a program's resources for each thread ("lightweight"). • - Each thread also benefits from a global memory view because it shares the memory space of a.out.
A thread may best be described as a block of code within the main program. • - Any thread can execute its code at the same time as other threads. • Threads communicate with each other through global memory (updating address locations). • - This requires synchronization constructs to ensure that more than one thread is not updating the same global address at any time. • Threads can come and go, but a.out remains present to provide the necessary shared resources until the application has completed. • A number of languages support threads such as Java, C# and Python.
Distributed Memory - Message Passing Model This model demonstrates the following characteristics: Message Passing Model - A set of tasks that use their own local memory during computation. - Multiple tasks can reside on the same physical machine and/or across an arbitrary number of machines. - Tasks exchange data through communications by sending and receiving messages. - Data transfer usually requires cooperative operations to be performed by each process. - For example, a send operation must have a matching receive operation.
The standard for message passing is the Message Passing Interface (MPI) library. A number of languages support this library.
Data Parallel Model This model demonstrates the following characteristics: - Address space is treated globally - Most of the parallel work focuses on performing operations on a data set. - The data set is typically organized into a common data structure, such as an array. - A set of tasks work collectively on the same data structure. However, each task works on a different partition of the same data structure. - Tasks perform the same operation on their partition of work, for example, "add 4 to every array element". On shared memory architectures, all tasks may have access to the data structure through global memory. On distributed memory architectures the data structure is split up and resides as "chunks" in the local memory of each task.
Single Program, Multiple Data (SPMD): SPMDis actually a "high level" programming model that can be built upon any combination of the previously mentioned parallel programming models. Single Program: All tasks execute their copy of the same program simultaneously. This program can be threads, message passing, data parallel or hybrid. Multiple Data: All tasks may use different data SPMDprograms mayhave the necessary logic programmed into them to allow different tasks to branch or conditionally execute only those parts of the program they are designed to execute. - That is, tasks do not necessarily have to execute the entire program - perhaps only a portion of it. The SPMD model, using message passing or hybrid programming, is probably the most commonly-used parallel programming model for multi-node clusters. MapReduce is based on this model.
Multiple Program, Multiple Data (MPMD) Like SPMD, MPMD is actually a "high level" programming model that can be built upon any combination of the previously mentioned parallel programming models. Multiple Program:Tasks may execute different programs simultaneously. The programs can be threads, message passing, data parallel or hybrid. Multiple Data: All tasks may use different data MPMDapplications are not as common as SPMD applications, but may be better suited for certain types of problems, particularly those that lend themselves better to functional (i.e. code) decomposition than domain (i.e. data) decomposition
Designing Parallel Programs Partitioning One of the first steps in designing a parallel program is to break the problem into discrete "chunks" of work that can be distributed to multiple tasks. This is known as decomposition or partitioning. There are two basic ways to partition computational work among parallel tasks: domain decomposition and functional decomposition.
Domain Decomposition In this type of partitioning, the data associated with a problem is broken into smaller chunks. Each parallel task then works on a portion of the data. MapReduce works like this, with all tasks being identical.
Functional Decomposition In this approach, the focus is on the computation that is to be performed rather than on the data manipulated by the computation. The problem is decomposed according to the work that must be done. Each task then performs a portion of the overall work.
Multi-Tier Cloud Applications • A cloud application is typically made up of different components • Front end: e.g. load-balanced, stateless web servers • Middle worker tier: E.g. number/data crunching • Backend storage: E.g. SQL tables or files • Multiple instances of each for scalability and availability Front-End Middle-Tier Middle-Tier Windows Azure Storage,SQL Azure Middle-Tier Front-End HTTP/HTTPS Load Balancer My Cloud Application