DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING

DISTRIBUTED ANDHIGH-PERFORMANCE COMPUTING Chapter 9 : Distributed Computing vs Distributed High Performance Computing

What is Distributed Computing • Distributed computing is a field of computer science that studies distributed systems. • A distributed system consists of multiple autonomous computers that communicate through a computer network. The computers interact with each other in order to achieve a common goal. • A computer program that runs in a distributed system is called a distributed program, and distributed programming is the process of writing such programs.

What is Distributed High Performance Computing • Distributed High-Performance Computing (HPDC) combines the advances in research and technologies in high speed networks, software, distributed computing and parallel processing to deliver high-performance, large-scale and cost-effective computational, storage and communication capabilities to a wide range of applications.

Issues in Distributed System • Architectures • Processes • Communication • Naming • Synchronization • Consistency and Replication • Fault Tolerance • Security

1. Architecture Models • Three basic architectural models for distributed systems: • workstations/servers model; • processor pool (thin client) model; • integrated model.

Example 1: Internet 6

Example 2 : Intranets 7

Example 5: Distributed Mutlimedia System 8

1. DHPC : Example of Architecture for Distributed High Performance Computing

2. Processes : Interprocess Communication • Distributed processes or tasks need to communicate • For distributed computing we usually do not have shared memory, so need to use message passing method • process A sends a message to process B • process B receives it • send/receive may be synchronous (A blocks until B receives the message) or asynchronous (some buﬀering mechanism allows A to proceed as soon as it has sent data) • simple ideas: send/receive, and some startup interrogation to ﬁnd out process identities, form basis for distributed and parallel computation mechanism • pairing or receive and send together into a single unit forms a transaction

3. Communication : Remote Procedure Calls • Need a standard mechanism for invoking some processing on a remote machine • Remote Procedure Calls (RPC) enable this for procedural languages • C-based precursor to remote method invocation in object-oriented systems like Java and CORBA • RPCs look like normal procedure calls – relatively transparent API • When an RPC happens, input parameters are copied to the destination process • Body of the procedure is executed in the context of the remote process • Output parameters are copied back and the call returns • RPCs are implemented using a structured form of message passing • RPC transparency does break down however, e.g. timeouts on RPC calls are sometimes desirable • Also call by value (copy) semantics are necessary, so cannot transparently pass pointer types over RPC • Cost of the remote call can be orders of magnitude greater than a local call, unless computation required for the call is much larger than time to initiate the RPC

3.Communication : Client/Server using RPC • RPC callee lifetime is almost always longer than the call • Callee is usually some kind of server • The callee never terminates (in practical terms) • For example: loop accept_call(...); process_this_call(...); complete_call(...); end loop; • Hence the RPCs have a sort of local data persistence • RPC calls of this sort are a form of generator • Interesting set of problems in controlling long lived data and resource allocation and access at the server end

4. Naming : Name Space Distribution • Names are used to share resources, to uniquely identify entities, to refer to locations in computer systems. • An important issue with naming is that a name can be resolved to the entity it refers to. Name resolution allows a process to access the named entity. • To resolve names, it is necessary to implement a naming system. • The different between naming in DSs and non-DSs lies in the way naming systems are implemented. In a DS, the implementation of a naming system is itself often distributed across multiple machines. • Two major issues in designing naming systems in DS: efficiency and scalability. 13

4. Naming : Name Space Distribution • An example partitioning of the DNS name space, including Internet-accessible files, into three layers. 14

4. Naming : Name Space Distribution • A naming service is implemented by name servers. In large DSs with many entities it is necessary to distribute the implementation of a name space over multiple name servers. • To efficiently implement a name space for a large-scale, possibly worldwide, DS, it is usually organized hierarchically and may be partitioned into logical layers: * global layer: formed by the highest-level nodes, e.g., root and other directory nodes logically close to the root. The directory tables in these nodes are rarely changed. * administrational layer: formed by the directory nodes managed within single organization. The nodes in this layer are relatively stable although less stable than those in global layers. 15

4. Naming : Name Space Distribution * managerial layer: formed by the nodes that may change regularly, e.g., nodes representing hosts in the LAN. The nodes in this layer are also maintained by end users of a DS. • The distribution of a name space across multiple name servers affects the implementation of name resolution. • Iterative name resolution: The root name server contacts the other name servers iteratively to resolve the name. • Recursive name resolution: The root name server contacts the other name servers recursively to resolve the name. 16

DHPC : 4. Naming System • Refer to article

5. Synchronization • We need to measure time accurately: • to know the time an event occurred at a computer • to do this we need to synchronize its clock with an authoritative external clock • Algorithms for clock synchronization useful for • concurrency control based on timestamp ordering • There is no global clock in a distributed system • Logical time is an alternative • It gives ordering of events - also useful for consistency of replicated data

6. Consistency and Replication • Two primary reasons for replicating data in DS: reliability and performance. • Reliability: It can continue working after one replica crashes by simply switch to one of the other replicas; Also, it becomes possible to provide better protection against corrupted data. • Performance: When the number of processes to access data managed by a server increases, performance can be improved by replicating the server and subsequently dividing the work; Also, a copy of data can be placed in the proximity of the process using them to reduce the time of data access. • Consistency issue: keeping all replicas up-to-date. 19

6.1 Distribution Protocols: Replica Placement • Several ways of distributing (propagating) updates to replicas, independent of the supported consistency model, have been proposed. • Replica Placement: deciding where, when, and by whom copies of the data store are to be placed. • Three different types of copies, permanent replicas, server-initiated replicas, and client-initiated replicas, can be distinguished, and logically organized as show in the next slide. • Permanent replicas: the initial set of replicas constituting a distributed data store. 20

6.1 Server Initiated Replicas • Server-initiated replicas: copies of a data store for enhancing performance. They are created at the initiative of the (owner of the) data store. • For example, it may be worthwhile to install a number of such replicas of a Web server in regions where many requests are coming from. • One of the major problems with such replicas is to decide exactly where and when the replicas should be created or deleted. • Server-initiated replication is gradually increasing in popularity, especially in the context of Web hosting services. Such hosting services can dynamically replicate files to servers close to demanding clients. 21

6.1 Client-initiated replicas • Client-initiated replicas: copies created at the initiative of clients, known as caches. • In principle, managing the cache is left entirely to the client, but there are many occasions in which the client can rely on participation from the data store to inform it when the cached data has become stale. • Placement of client caches is relatively simple: a cache is normally placed in the same machine as its client, or on a machine shared by clients in the same LAN. • Data are generally kept in a cache for a limited amount time to prevent extremely stale data from being used, or simply to make room for other data. 22

Replica Placement

DHPC : 6. Replica consistency in a Data Grid • A Data Grid is a wide area computing infrastructure that employs Grid technologies to provide storage capacity and processing power to applications that handle very large quantities of data. Data Grids rely on data replication to achieve better performance and reliability by storing copies of data sets on different Grid nodes. When a data set can be modified by applications, the problem of maintaining consistency among existing copies arises.

DHPC : 6. Replica consistency in a Data Grid • The consistency problem also concerns metadata, i.e., additional information about application data sets such as indices, directories, or catalogues. This kind of metadata is used both by the applications and by the Grid middleware to manage the data. For instance, the Replica Management Service (the Grid middleware component that controls data replication) uses catalogues to find the replicas of each data set. Such catalogues can also be replicated and their consistency is crucial to the correct operation of the Grid. • Therefore, metadata consistency generally poses stricter requirements than data consistency. In this paper we report on the development of a Replica Consistency Service based on the middleware mainly developed by the European Data Grid Project. The paper summarises the main issues in the replica consistency problem, and lays out a high-level architectural design for a Replica Consistency Service. Finally, results from simulations of different consistency models are presented.

Cont..

7: Fault Tolerance • Failure: When a component is not living up to its specifications, a failure occurs • Error: That part of a component's state that can lead to a failure • Fault: The cause of an error • Fault prevention: prevent the occurrence of a fault • Fault tolerance: build a component in such a way that it can meet its specifications in the presence of faults

7.1 Failure Models • Different types of failures.

DHPC : 7. Fault Tolerance in Grid • Refer to article

In distributed systems, securityis the combination of availability, integrity, and confidentiality. A dependable distributed system is thus fault tolerant and secure. 8. Security

8.1 Types of Threats

8.2 Security Mechanisms • Encryption • Hiding message content • Check for message modification • Authentication • Verifying subject identity • Authorization • Auditing • Closing the barn door

DHPC : 8. Security in Grid • Refer to article

Difference Between Grid Computing Vs. Distributed Computing • Definition of Distributed Computing • Distributed Computing is an environment in which a group of independent and geographically dispersed computer systems take part to solve a complex problem, each by solving a part of solution and then combining the result from all computers. These systems are loosely coupled systems coordinately working for a common goal. It can be defined as :- • A computing system in which services are provided by a pool of computers collaborating over a network . • A computing environment that may involve computers of differing architectures and data representation formats that share data and system resources.

Difference Between Grid Computing Vs. Distributed Computing • Definition of Grid Computing • The Basic idea between Grid Computing is to utilize the ideal CPU cycles and storage of million of computer systems across a worldwide network function as a flexible, pervasive, and inexpensive accessible pool that could be harnessed by anyone who needs it, similar to the way power companies and their users share the electrical grid. There are many definitions of the term: Grid computing: • A service for sharing computer power and data storage capacity over the Internet • An ambitious and exciting global effort to develop an environment in which individual users can access computers, databases and experimental facilities simply and transparently, without having to consider where those facilities are located. [RealityGrid, Engineering & Physical Sciences Research Council, UK 2001] http://www.realitygrid.org/information.html • Grid computing is a model for allowing companies to use a large number of computing resources on demand, no matter where they are located.www.informatica.com/solutions/resource_center/glossary/default.htm

Difference Between Grid Computing Vs. Distributed Computing • Since 1980, two advances in technology has made distributed computing a more practical idea, computer CPU power and communication bandwidth. The result of these technologies is not only feasible but easy to put together large number of computer systems for solving complex computational power or storage requirements. But the numbers of real distributable applications are still somewhat limited, and the challenges are still significant (standardization, interoperability etc). • As it is clear from the definition, traditional distributed computing can be characterized as a subset of grid computing. some of the differences between these two are :-

Cont… • Distributed Computing normally refers to managing or pooling the hundreds or thousands of computer systems which individually are more limited in their memory and processing power. On the other hand, grid computing has some extra characteristics. It is concerned to efficient utilization of a pool of heterogeneous systems with optimal workload management utilizing an enterprise's entire computational resources( servers, networks, storage, and information) acting together to create one or more large pools of computing resources. There is no limitation of users, departments or originations in grid computing.

Cont… • Grid computing is focused on the ability to support computation across multiple administrative domains that sets it apart from traditional distributed computing. Grids offer a way of using the information technology resources optimally inside an organization involving virtualization of computing resources. Its concept of support for multiple administrative policies and security authentication and authorization mechanisms enables it to be distributed over a local, metropolitan, or wide-area network

Case Study : Distributed Computing • Air Traffic Management System • The Air Traffic Management System is an example of a distributed problem-solving system. • It has elements of both cooperative and competitive problem-solving. • It includes complex organizations such as Flight Operations Centers, the FAA • Air Traffic Control Systems Command Center (ATCSCC), and traffic management units at en route centers that focus on daily strategic planning, as well as individuals concerned more with immediate tactical decisions (such as air traffic controllers and pilots). • The design of this system has evolved over time to rely heavily on the distribution of tasks and control authority in order to keep cognitive complexity manageable for any one individual operator, and to provide redundancy (both human and technological) to serve as a safety net to catch the slips or mistakes that any one person or entity might make. • Within this distributed architecture, a number of different conceptual approaches have been applied to deal with cognitive complexity and to provide redundancy.

Cont… • These approaches can be characterized in terms of the strategy for distributing: • (1) control or responsibility, • (2) knowledge or expertise, • (3) access to data, • (4) processing capacity, and • (5) goals and priorities. • This paper will provide an abstract characterization of these alternative strategies for distributing work in terms of these 5 dimensions, and will illustrate and evaluate their effectiveness in terms of concrete realizations found within the National Airspace System.

Case Study : Distributed High Performance Computing • ATM-based Distributed High Performance Computing System • DISCWorld

An ATM-based Distributed High Performance Computing System • We describe the distributed high performance computing system, • we have developed to integrate together a heterogeneous set of high performance computers, • high capacity storage systems and fast communications hardware. • Our system is based upon Asynchronous Transfer Mode (ATM) communications technology and we routinely operate between the geographically distant sites of Adelaide and Canberra (separated by some ll00km), • using Telstra's ATM-based Experimental Broadband Network (EBN). • We discuss some of the latency and performance issues that result from running day-to-day operations across such a long distance network.

DISCWorld: A Distributed High Performance Computing Environment • An increasing number of science and engineering applications require distributed and parallel computing resources to satisfy user response time requirements. • Distributed science and engineering applications require a high performance "middleware" which will both allow the embedding of legacy applications as well as enable new distributed programs, • and which allows the best use of existing and specialised (parallel) computing resources. • We are developing a distributed information systems control environment which will meet the needs of a middleware for scientific applications. • We describe our DISCWorld system and some of its key attributes. • A critical attribute is architecture scalability. • We discuss DISCWorld in the context of some existing middleware systems such as CORBA and other distributed computing research systems such as Legion and Globus. • Our approach is to embed applications in the middleware as services, which can be chained together. • User interfaces are provided in the form of Java Applets downloadable across the World Wide Web. • These form a gateway for user-requests to be transmitted into a semi-opaque "cloud" of high-performance resources for distributed execution.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING

Presentation Transcript

Introduction to High Performance Computing: Parallel Computing, Distributed Computing, Grid Computing and More

HIGH PERFORMANCE COMPUTING

High-Performance Distributed Multimedia Computing

High Performance Computing

High-Performance Computing

High-Performance Computing

High Performance Distributed Computing

High-Performance Computing

High Performance Computing

SP3.1: High-Performance Distributed Computing

Komputasi Tersebar Distributed and High Performance Computing (dhpc)

High Performance Computing

High Performance Computing

Toward a Distributed and Parallel High Performance Computing Environment

High Performance Computing

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 2: ARCHITECTURE

Performance system for scalable parallel and distributed high-performance computing

Performance system for scalable parallel and distributed high-performance computing

HIGH PERFORMANCE COMPUTING

High Performance Computing

High Performance Distributed Computing