Presentation Overview

Presentation Overview 1. Models of Parallel Computing The evolution of the conceptual framework behind parallel systems. 2. Grid Computing The creation of a structure within the parallel framework to facilitate efficient use of shared resources. 3. Cilk Language and Scheduler A method for scheduling parallel tasks on a small low-latency network; and a programming language to provide parallel computing with time guarantees.

Presentation Overview Parallel Computing Grid Computing Resource Discovery Scheduling P2P Techniques Set Matching Cilk Routing Techniques

Models of Parallel Computing • All Models of Parallel Computing can be subdivided into these four broad categories: • Synchronous Shared Memory Models • Asynchronous Shared Memory Models • Synchronous Independent Memory Models • Asynchronous Independent Memory Models

What defines a synchronous model? Generally speaking in a synchronous system: • All memory operations take exactly unit-time. • All processors that wish to perform an operation at time t do so simultaneously. • Memory access conflicts are resolved using standard concurrency techniques.

Synchronous Shared Memory: PRAM • Consists of P RAM processors each with a local register set • Unbounded global shared memory • Processors operate synchronously

PRAM Properties • Processing units do not have their own memory. • Processing units communicate only via global memory. • Assumes synchronous memory access. • Each processor has random access to any global memory cell in unit-time.

Problems with PRAM Problem 1: Assumes that the processors act synchronously without any overhead. Problem 2: Assumes 100% processor and memory reliability. Problem 3: Does not exploit caching or locality (all operations are “performed” in the main memory). Problem 4: Model is unrealistic for real computers.

Asynchronous Shared Memory Models • Most Asynchronous Shared Memory systems build on the PRAM model making it more feasible for actual implementation. • We can easily make the PRAM model more realistic by assuming asynchronous operation, and including an explicit synchronization step after every round. • Where a round is the smallest unit of time that allows every processor to complete computation in a given time step. • These models can generally be implemented on MIMD architecture, and charge appropriately for the cost of synchronization.

Synchronous Independent Memory Models • These models consist of a connected set of processor/memory pairs • Synchronization is assumed during computation • The best example of a synchronous independent memory model is Bulk-Synchronous Parallel (BSP)

Bulk-Synchronous Parallel Model (BSP): • Processing units are processor/memory pairs • There is a router to provide inter-processor communication • There is a barrier synchronizer to explicitly synchronize computation

BSP Properties • BSP is conceptually simple, and provides a nice bridge to future models of computation that do not rely on shared memory. • BSP is intuitive from a programming standpoint • Can use any network topology with a router. • Inter-processor message delivery time is not guaranteed, only a lower bound can be achieved (network latency). • Synchronous operation taken for granted in program cost. • Synchronization time is not guaranteed.

Asynchronous Independent Memory Models • Most asynchronous independent memory models build on the BSP framework. • These models tend to generalize BSP while providing upper bounds on communication cost and overhead. • We briefly summarize the LogP model.

LogP • Provides upper bound on network latency, and thus inter-processor communication time (overhead) • All processors are seen as equidistant (network diameter is used for analysis) • Resolves problems with router saturation in BSP • Solves some of BSPs practical problems.

Summary of Computing Models • Shared memory models are conceptually ideal from a programming point of view, but difficult to implement. • Independent memory models are more feasible, but add complexity to synchronization. • We will proceed to discuss Grid Computing with the general LogP model in mind.

Grid Computing Exploring Resource Discovery Protocols

What is Grid Computing? “A grid is a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to computational resources. These resources include, but are not limited to, processors, data storage devices, software applications, and instruments such as telescopes or satellite dishes”. [Foster, Kesselman 1998]

What is Grid Computing? • Dependability: The system must provide predictable and sustained service. • Consistency: A grid should provide uniform service despite the vast heterogeneity of connected systems. • Pervasiveness: Services should be constantly available regardless of where you move throughout the system (or similar service should be available) • Inexpensiveness: The distributed structure should allow for affordable use of computational power relative to income and use.

What is Grid Computing? “[Grid Computing] is the synergisticuse of high-performance networking, computing, and advanced software to provide access to advanced computational capabilities, regardless of the location of users and resources.” [Foster 1998]

What is Grid Computing? Goal: To access and make efficient use of remote resources.

The Power GridA Motivating Analogy Computation today is like electricity in 1910 • In 1910 efficient electric power generation was possible, but every user had to have his own generator. • Connecting many heterogeneous electric generators together in a grid provided low-cost access to standardized service. • Similarly a computational grid could provide reliable low-cost access to computational power.

Why do we want a Grid? • Solving difficult research problems • Running large scale simulations • Increase resource utilization • Efficient use of scarce/distant resources • Collaborative design and education

Major Classes of Grid Use • Distributed Computing • High Throughput • On Demand • Data Intensive • Collaborative

Challenges of Grid Computing • Building a Framework for communication • Parallelizing code • Dynamically scheduling resource use • Providing consistentservice despite heterogeneity • Providing reliable service despite local failures • Finding resources efficiently

Finding Resources in the Grid • Determine what resources we will need to solve the problem • Locate sufficient resources in the Grid • Reserve these resources • Execute the problem run Given an instance (or run) of a problem we want to solve, how can we expedite the following?

Different Views of the Resource Discovery Problem • A Peer-to-Peer Indexing problem • A Routing Problem • A Web search/crawling problem We can think of the Resource Discovery problem in 3 ways: We will need to repose the Resource Discovery Problem under each of these disciplines

P2P for Resource Discovery • We will need a separate index for each resource • Several resources may be used in parallel • We want least cost fit whenever possible, but over fit is likely acceptable • We need to have accountability for resource use, and a way to credit users who share resources • We will want caching since users are likely to request the same types of resources multiple times

P2P for Resource Discovery Peer-to-Peer structure is desirable, but the search/lookup must be modified We will start to solve this problem employing set matching techniques for peer-to-peer lookups.

Condor Classified Advertisements • Condor Classified Advertisements (ClassAds) provide a mapping from attribute names to expressions • Condor matchmaking takes two ClassAds and evaluates one w.r.t. the other • Two ClassAds match iff each has an attribute “requirements” that evaluates as true in the context of the other ClassAd • A ClassAd can have an attribute “rank” that gives a numerical value to the quality of a particular matching (large rank == better match).

Set Extended ClassAd Syntax We can extend this structure and consider a match between a single set request, and a ClassAd set: • set expressions: Place constraints on collective properties of the set (e.g. Total disk space or total processing power) • individual expressions: Place constraints on each ClassAd in the set (e.g. Each computer must have more than 1GB of RAM) In this context the Set Matching Algorithm will attempt to create a set of ClassAds that meets both individual and set requirements

Set Matching Algorithm Note: The number of possible set matches is exponential in the size of ClassAds, so we will proceed with a heuristic approach.

Set Matching Algorithm: variables ClassAdSet:Set of all ClassAd to be considered BestSet:Closest set found so far CandidateSet:Set consider at each iteration LastRank:Rank of BestSet Rank:Rank of CandidateSet

Set Matching Algorithm While (ClassAdSet is not empty) { next={X| X=argmax(rank(Y+CandidateSet)), for all Y in ClassAdSet}; ClassAdSet-=next; CandidateSet+=next; Rank=rank(CandidateSet) If (requirements(CandidateSet)==true and Rank>LastRank) BestSet=CandidateSet; LastRank=Rank; } return BestSet;

Resource Discovery We can use Set Matching for Resource Discovery: • User provides mapper that maps workload for a certain application or problem to resource requirements and topology • Resource set is compiled using MDS and a “resource monitor” • Set-matching is applied in conjunction with the mapper to find an appropriate set of resources

Resource Discovery: MDS • MDS: Monitoring and Discovery Service component of Globus™ Toolkit provides information about a server’s configuration, CPU load, etc… • Any query tool can be used in its place • Servers can be queried periodically to maintain central database, or as needed within P2P structure

Resource Discovery: Architecture

P2P Resource Discovery Consider a P2P network with fixed degree topology where each node has the ClassAd for all of its neighbors We could attempt to locate resources using the following technique: • Run Set-Matching locally on ClassAd NeighborSet • If requirements are not met forward BestSet to a neighbor • Repeat process without visiting a node more than once • Report BestSet (or CandidateSet) when TTL expires

What is Cilk? • “Silk” is a C based runtime system for multithreaded distributed applications. • Including: • C Language extension. • Thread Scheduler.

What are Cilk’s Goals? • Provide a guaranteed bound on running time. • Define a set of problems lend themselves to efficient distributed multithreading. • Encouraging programmers to code for multithreading.

Motivation: • Multithreaded programs written in a traditional language like C/C++ typically run within an acceptable approximation of the optimal running time when used in practice. • These same implementations often have poor worst case performance. • Cilk guarantees performance within a constant factor of optimal, but limits itself to a subset of fully strict problems.

What is this fully strict business? • 1. A fully strict computation consists of tasks that pass data only to their direct parent task. • A task is a single time unit of work. • Threads are composed of one or more tasks in order. • 2. In a fully strict computation, threads can not block. • Instead, a thread spawns a special successor thread to receive return values. • Successor threads do not acquire the CPU until the return values are ready.

Additional Definitions: • Task – A single time unit of work, executed by exactly one processor. • Thread States – A thread can be alive (ready to execute) or stalled (waiting for data from another thread). • Activation Frame – The memory shared by tasks in a single thread, which remains allocated regardless of the state of the thread. • Activation Subtree – At any time t, the activation subtree consists of those threads that are alive. • Activation Depth – The combined size of all child activation frames with respect to a parent thread.

The Cilk Model of Multithreaded Computation

Scheduling with Work Stealing • Work Sharing – When a task is created, the host tries to migrate it to another processor. The drawback is that threads are migrated even if overall workload is high. • Work Stealing – Under utilized processors attempt to migrate tasks from other processors. • The advantage is that under high workload, communication is minimized, because task migration only takes place when the recipient of the task has the necessary resources to service it.

Goals of Work Stealing • Keep the processors busy. • Bound runtimes. • Limit number of active threads in order to bound memory usage. • Maximize locality of related tasks (keep them on the same processor). • Minimize communication between remote tasks.

Work Stealing Definitions: • T1 = number of tasks in a computation, also the time it would take on a single processor. • TP = time used on a P processor scheduling of a computation. • T∞ = depth of the computations critical path. • S1 = activation depth of a computation on a single processor. • SP = activation depth on P processors. • Remember: • Activation Depth – The combined size of all child activation frames (allocated memory) with respect to a parent task.

Greedy Scheduling At each step, execute anything that is ready, in any order, utilizing as many processors as you have ready tasks (i.e., tasks not waiting on a dependency). Analysis: achieves TP <= T1 / P + T∞ In other words, it will take less than or equal to the amount of time it would take to compute each task plus the time to compute the critical path, i.e. the longest chain of dependencies. Problem: Memory usage is unbounded.

Memory Usage with Greedy Scheduling Greedy Scheduling can duplicate memory across multiple processors. For example, when a new task is spawned and different processors are handling the parent and the child, the parent’s address space will also be copied to the processor handling the child. We want an algorithm that guarantees that total memory usage will be within a constant of what the computation would consume on a single processor.

Busy-Leaves Scheduling with thread pools • A global pool is kept containing threads not bound to a processor. • All processors follow this algorithm: • If empty, get a new process A from the pool. • If A spawns a thread B, return A to the pool and commence work on B. • If A stalls, return A to the pool. • If A dies, check if all its parent’s (B) children thread are dead. If so, commence work on B. • This algorithm essentially guarantees that all leaves in the execution tree are busy.

Analysis of Busy-Leaves Scheduling • TP <= T1 / P + T∞ • SP <= S1 * P. • In other words, the amount of memory allocated for the entire computation will be less than or equal to the amount of memory it would take to run on a single processor. • Problem: Competition for access to the global thread pool can slow down the overall running time.

Randomized Work-Stealing Algorithm Randomized Work-Stealing eliminates the global shared pool, and replaces it with a stack at each processor. New tasks are put on the top of the stack, and migrated tasks are taken off the bottom of the stack. Algorithm: • If empty, remove a thread from the bottom of the stack (A). • If A enables stalled parent B, B is placed on the stack. B may have to be found and stolen from another stack. • If A spawns a child C, A is put on the stack and work on C commences. • If A dies or stalls, check the stack for another task. If one exists, commence execution. If the stack is empty, steal the bottommost thread of a random processor

Presentation Overview