High performance cluster computing
This presentation is the property of its rightful owner.
Sponsored Links
1 / 161

High Performance Cluster Computing PowerPoint PPT Presentation


  • 97 Views
  • Uploaded on
  • Presentation posted in: General

High Performance Cluster Computing. By: Rajkumar Buyya, Monash University, Melbourne. [email protected] http://www.dgs.monash.edu.au/~rajkumar. Objectives. Learn and Share Recent advances in cluster computing (both in research and commercial settings): Architecture,

Download Presentation

High Performance Cluster Computing

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


High performance cluster computing

High Performance Cluster Computing

By: Rajkumar Buyya, Monash University, Melbourne.

[email protected] http://www.dgs.monash.edu.au/~rajkumar


Objectives

Objectives

  • Learn and Share Recent advances in cluster computing (both in research and commercial settings):

    • Architecture,

    • System Software

    • Programming Environments and Tools

    • Applications


Agenda

Agenda

  • Overview of Computing

  • Motivations & Enabling Technologies

  • Cluster Architecture & its Components

  • Clusters Classifications

  • Cluster Middleware

  • Single System Image

  • Representative Cluster Systems

    • Berkeley NOW and Solaris-MC

  • Resources and Conclusions


Computing elements

Threads Interface

Microkernel

Multi-Processor Computing System

P

P

P

P

P

P

P

Processor

Process

Thread

Computing Elements

Applications

Programming Paradigms

Operating System

Hardware


Two eras of computing

Commercialization

R & D Commodity

Two Eras of Computing

Sequential

Era

Architectures

System Software

Applications

P.S.Es

Architectures

System Software

Applications

P.S.Es

Parallel

Era

1940 50 60 70 80 90 2000 2030


Announcement formation of

Announcement: formation of

IEEE Task Force on Cluster Computing

(TFCC)

http://www.dgs.monash.edu.au/~rajkumar/tfcc/

http://www.dcs.port.ac.uk/~mab/tfcc/


Tfcc activities

TFCC Activities...

  • Network Technologies

  • OS Technologies

  • Parallel I/O

  • Programming Environments

  • Java Technologies

  • Algorithms and Applications

  • >Analysis and Profiling

  • Storage Technologies

  • High Throughput Computing


Tfcc activities1

TFCC Activities...

  • High Availability

  • Single System Image

  • Performance Evaluation

  • Software Engineering

  • Education

  • Newsletter

  • Industrial Wing

    • All the above have there own pages, see pointers from: http://www.dgs.monash.edu.au/~rajkumar/tfcc/


Tfcc activities2

TFCC Activities...

  • Mailing list, Workshops, Conferences, Tutorials, Web-resources etc.

  • Resources for introducing subject in senior undergraduate and graduate levels.

  • Tutorials/Workshops at IEEE Chapters..

  • ….. and so on.

  • Visit TFCC Page for more details:

    • http://www.dgs.monash.edu.au/~rajkumar/tfcc/ periodically (updated daily!).


Computing power and computer architectures

Computing Power andComputer Architectures


Need of more computing power grand challenge applications

Geographic

Information

Systems

Need of more Computing Power:Grand Challenge Applications

Solving technology problems using

computer modeling, simulation and analysis

Life Sciences

Aerospace

Mechanical Design & Analysis (CAD/CAM)


How to run app faster

How to Run App. Faster ?

  • There are 3 ways to improve performance:

    • 1. Work Harder

    • 2. Work Smarter

    • 3. Get Help

  • Computer Analogy

    • 1. Use faster hardware: e.g. reduce the time per instruction (clock cycle).

    • 2. Optimized algorithms and techniques

    • 3. Multiple computers to solve problem: That is, increase no. of instructions executed per clock cycle.


Sequential architecture limitations

Sequential Architecture Limitations

  • Sequential architectures reaching physical limitation (speed of light, thermodynamics)

  • Hardware improvements like pipelining, Superscalar, etc., are non-scalable and requires sophisticated Compiler Technology.

  • Vector Processing works well for certain kind of problems.


Computational power improvement

Computational Power Improvement

Multiprocessor

Uniprocessor

C.P.I.

1 2 . . . .

No. of Processors


Human physical growth analogy computational power improvement

Human Physical Growth Analogy:Computational Power Improvement

Vertical

Horizontal

Growth

5 10 15 20 25 30 35 40 45 . . . .

Age


Why parallel processing now

Why Parallel Processing NOW?

  • The Tech. of PP is mature and can be exploited commercially; significant R & Dwork on development of tools & environment.

  • Significant development in Networking technology is paving a way for heterogeneous computing.


History of parallel processing

History of Parallel Processing

  • PP can be traced to a tablet dated around 100 BC.

    • Tablet has 3 calculating positions.

    • Infer that multiple positions:

      • Reliability/ Speed


Motivating factors

Motivating Factors

  • Aggregated speed with

    which complex calculations

    carried out by millions of neurons in human brain is amazing! although individual neurons response is slow (milli sec.) - demonstrate the feasibility of PP


Taxonomy of architectures

Taxonomy of Architectures

  • Simple classification by Flynn:

    (No. of instruction and data streams)

    • SISD - conventional

    • SIMD - data parallel, vector computing

    • MISD - systolic arrays

    • MIMD - very general, multiple approaches.

  • Current focus is on MIMD model, using general purpose processors or multicomputers.


Sisd a conventional computer

Instructions

Processor

Data Output

Data Input

SISD : A Conventional Computer

  • Speed is limited by the rate at which computer can transfer information internally.

Ex:PC, Macintosh, Workstations


The misd architecture

Instruction

Stream A

Instruction

Stream B

Instruction Stream C

Processor

A

Data

Output

Stream

Data

Input

Stream

Processor

B

Processor

C

The MISD Architecture

  • More of an intellectual exercise than a practical configuration. Few built, but commercially not available


Simd architecture

Instruction

Stream

Data Output

stream A

Data Input

stream A

Processor

A

Data Output

stream B

Processor

B

Data Input

stream B

Data Output

stream C

Processor

C

Data Input

stream C

SIMD Architecture

Ex: CRAY machine vector processing, Thinking machine cm*

Ci<= Ai * Bi


Mimd architecture

MIMD Architecture

Instruction

Stream A

Instruction

Stream B

Instruction

Stream C

Unlike SISD, MISD, MIMD computer works asynchronously.

Shared memory (tightly coupled) MIMD

Distributed memory (loosely coupled) MIMD

Data Output

stream A

Data Input

stream A

Processor

A

Data Output

stream B

Processor

B

Data Input

stream B

Data Output

stream C

Processor

C

Data Input

stream C


Main hpc architectures 1a

Main HPC Architectures..1a

  • SISD - mainframes, workstations, PCs.

  • SIMD Shared Memory - Vector machines, Cray...

  • MIMDShared Memory - Sequent, KSR, Tera, SGI, SUN.

  • SIMD Distributed Memory - DAP, TMC CM-2...

  • MIMD Distributed Memory - Cray T3D, Intel, Transputers, TMC CM-5, plus recent workstation clusters (IBM SP2, DEC, Sun, HP).


Main hpc architectures 1b

Main HPC Architectures..1b.

  • NOTE: Modern sequential machines are not purely SISD - advanced RISC processors use many concepts from

    • vector and parallel architectures (pipelining, parallel execution of instructions, prefetching of data, etc) in order to achieve one or more arithmetic operations per clock cycle.


Parallel processing paradox

Parallel Processing Paradox

  • Time required to develop a parallel application for solving GCA is equal to:

    • Half Life of Parallel Supercomputers.


The need for alternative supercomputing resources

The Need for Alternative Supercomputing Resources

  • Vast numbers of under utilised workstations available to use.

  • Huge numbers of unused processor cycles and resources that could be put to good use in a wide variety of applications areas.

  • Reluctance to buy Supercomputer due to their cost and short life span.

  • Distributed compute resources “fit” better into today's funding model.


Scalable parallel computers

Scalable Parallel Computers


Design space of competing computer architecture

Design Space of Competing Computer Architecture


Towards inexpensive supercomputing

Towards Inexpensive Supercomputing

It is:

Cluster Computing..

The Commodity Supercomputing!


Motivation for using clusters

Motivation for using Clusters

  • Surveys show utilisation of CPU cycles of desktop workstations is typically <10%.

  • Performance of workstations and PCs is rapidly improving

  • As performance grows, percent utilisation will decrease even further!

  • Organisations are reluctant to buy large supercomputers, due to the large expense and short useful life span.


Motivation for using clusters1

Motivation for using Clusters

  • The communications bandwidth between workstations is increasing as new networking technologies and protocols are implemented in LANs and WANs.

  • Workstation clusters are easier to integrate into existing networks than special parallel computers.


Motivation for using clusters2

Motivation for using Clusters

  • The development tools for workstations are more mature than the contrasting proprietary solutions for parallel computers - mainly due to the non-standard nature of many parallel systems.

  • Workstation clusters are a cheap and readily available alternative to specialised High Performance Computing (HPC) platforms.

  • Use of clusters of workstations as a distributed compute resource is very cost effective - incremental growth of system!!!


Cycle stealing

Cycle Stealing

  • Usually a workstation will be owned by an individual, group, department, or organisation - they are dedicated to the exclusive use by the owners.

  • This brings problems when attempting to form a cluster of workstations for running distributed applications.


Cycle stealing1

Cycle Stealing

  • Typically, there are three types of owners, who use their workstations mostly for:

    1. Sending and receiving email and preparing documents.

    2. Software development - edit, compile, debug and test cycle.

    3. Running compute-intensive applications.


Cycle stealing2

Cycle Stealing

  • Cluster computing aims to steal spare cycles from (1) and (2) to provide resources for (3).

  • However, this requires overcoming the ownership hurdle - people are very protective of their workstations.

  • Usually requires organisational mandate that computers are to be used in this way.


Cycle stealing3

Cycle Stealing

  • Stealing cycles outside standard work hours (e.g. overnight) is easy, stealing idle cycles during work hours without impacting interactive use (both CPU and memory) is much harder.


Rise fall of computing technologies

Rise & Fall of Computing Technologies

MainframesMinisPCs

MinisPCsNetwork

Computing

1970 1980 1995


High performance cluster computing

Original Food Chain Picture


High performance cluster computing

1984 Computer Food Chain

Mainframe

PC

Workstation

Mini Computer

Vector Supercomputer


High performance cluster computing

1994 Computer Food Chain

(hitting wall soon)

Mini Computer

PC

Workstation

Mainframe

(future is bleak)

Vector Supercomputer

MPP


High performance cluster computing

Computer Food Chain (Now and Future)


What is a cluster

What is a cluster?

  • Cluster:

    • a collection of nodes connected together

    • Network: Faster, closer connection than a typical network (LAN)

    • Looser connection than symmetric multiprocessor (SMP)


1990s building blocks

1990s Building Blocks

  • There is no “near commodity” component

  • Building block = complete computers(HW & SW) shipped in 100,000s:Killer micro, Killer DRAM, Killer disk,Killer OS, Killer packaging, Killer investment

  • Leverage billion $ per year investment

  • Interconnecting Building Blocks => Killer Net

    • High Bandwidth

    • Low latency

    • Reliable

    • Commodity(ATM?)


Why clusters now beyond technology and cost

Why Clusters now?(Beyond Technology and Cost)

  • Building block is big enough (v intel 8086)

  • Workstations performance is doubling every 18 months.

  • Networks are faster

    • Higher link bandwidth (v 10Mbit Ethernet)

    • Switch based networks coming (ATM)

    • Interfaces simple & fast (Active Msgs)

  • Striped files preferred (RAID)

  • Demise of Mainframes, Supercomputers, & MPPs


Architectural drivers cont

Architectural Drivers…(cont)

  • Node architecture dominates performance

    • processor, cache, bus, and memory

    • design and engineering $ => performance

  • Greatest demand for performance is on large systems

    • must track the leading edge of technology without lag

  • MPP network technology => mainstream

    • system area networks

  • System on every node is a powerful enabler

    • very high speed I/O, virtual memory, scheduling, …


Architectural drivers

...Architectural Drivers

  • Clusters can be grown: Incremental scalability (up, down, and across)

    • Individual nodes performance can be improved by adding additional resource (new memory blocks/disks)

    • New nodes can be added or nodes can be removed

    • Clusters of Clusters and Metacomputing

  • Complete software tools

    • Threads, PVM, MPI, DSM, C, C++, Java, Parallel C++, Compilers, Debuggers, OS, etc.

  • Wide class of applications

    • Sequential and grand challenging parallel applications


Example clusters berkeley now

Example Clusters:Berkeley NOW

  • 100 Sun UltraSparcs

    • 200 disks

  • Myrinet SAN

    • 160 MB/s

  • Fast comm.

    • AM, MPI, ...

  • Ether/ATM switched external net

  • Global OS

  • Self Config


Basic components

P

P

Basic Components

MyriNet

160 MB/s

Myricom

NIC

M

M

I/O bus

$

Sun Ultra 170


Massive cheap storage cluster

Massive Cheap Storage Cluster

  • Basic unit:

    2 PCs double-ending four SCSI chains of 8 disks each

Currently serving Fine Art at http://www.thinker.org/imagebase/


Cluster of smps clumps

Cluster of SMPs (CLUMPS)

  • Four Sun E5000s

    • 8 processors

    • 4 Myricom NICs each

  • Multiprocessor, Multi-NIC, Multi-Protocol

  • NPACI => Sun 450s


Millennium pc clumps

Millennium PC Clumps

  • Inexpensive, easy to manage Cluster

  • Replicated in many departments

  • Prototype for very large PC cluster


Adoption of the approach

Adoption of the Approach


So what s so different

So What’s So Different?

  • Commodity parts?

  • Communications Packaging?

  • Incremental Scalability?

  • Independent Failure?

  • Intelligent Network Interfaces?

  • Complete System on every node

    • virtual memory

    • scheduler

    • files

    • ...


Opportunities challenges

OPPORTUNITIES & CHALLENGES


High performance cluster computing

Shared Pool ofComputing Resources:Processors, Memory, Disks

Interconnect

Guarantee atleast oneworkstation to many individuals

(when active)

Deliver large % of collective

resources to few individuals

at any one time

Opportunity of Large-scaleComputing on NOW


Windows of opportunities

Windows of Opportunities

  • MPP/DSM:

    • Compute across multiple systems: parallel.

  • Network RAM:

    • Idle memory in other nodes. Page across other nodes idle memory

  • Software RAID:

    • file system supporting parallel I/O and reliablity, mass-storage.

  • Multi-path Communication:

    • Communicate across multiple networks: Ethernet, ATM, Myrinet


Parallel processing

Parallel Processing

  • Scalable Parallel Applications require

    • good floating-point performance

    • low overhead communication scalable network bandwidth

    • parallel file system


Network ram

Network RAM

  • Performance gap between processor and disk has widened.

  • Thrashing to disk degrades performance significantly

  • Paging across networks can be effective with high performance networks and OS that recognizes idle machines

  • Typically thrashing to network RAM can be 5 to 10 times faster than thrashing to disk


Software raid redundant array of workstation disks

Software RAID: Redundant Array of Workstation Disks

  • I/O Bottleneck:

    • Microprocessor performance is improving more than 50% per year.

    • Disk access improvement is < 10%

    • Application often perform I/O

  • RAID cost per byte is high compared to single disks

  • RAIDs are connected to host computers which are often a performance and availability bottleneck

  • RAID in software, writing data across an array of workstation disks provides performance and some degree of redundancy provides availability.


Software raid parallel file systems and parallel i o

Software RAID, Parallel File Systems, and Parallel I/O


Enabling technologies

Enabling Technologies

  • Efficient communication hardware and software

  • Global co-ordination of multiple workstation Operating Systems


Efficient communication

Efficient Communication

  • The key Enabling Technology

  • Communication overheads components

    • bandwidth

    • network latency and

    • processor overhead

  • Switched LANs allow bandwidth to scale

  • Network latency can be overlapped with computation

  • Processor overhead is the real problem - it consumes CPU cycles


Efficient communication contd

Efficient Communication (Contd...)

  • SS10 connected by Ethernet

    • 456 s processor overhead

  • With ATM

    • 626 s processor overhead

  • Target :

    • MPP communication performance: low latency and scalable bandwidth

    • CM5 user-level network overhead 5.7 s


Efficient communication contd1

Efficient Communication (Contd...)

  • Constraints in clusters

    • greater routing delay and less than complete reliability

    • constraints on where the network connects into the node

    • UNIX has a rigid device and scheduling interface


Efficient communication approaches

Efficient Communication Approaches

  • Efficient Network Interface Hardware

  • Minimal Interface into the Operating System

    • user must transmit directly into and receive from the network without OS intervention

    • communication protection domains to be established by interface card and OS

    • treat message loss as an infrequent case


High performance cluster computing

Cluster Computer and its Components


Clustering today

Clustering Today

  • Clustering gained momentum when 3 technologies converged:

    • 1. Very HP Microprocessors

      • workstation performance = yesterday supercomputers

    • 2. High speed communication

      • Comm. between cluster nodes >= between processors in an SMP.

    • 3. Standard tools for parallel/ distributed computing & their growing popularity.


Cluster computer architecture

Cluster Computer Architecture


Cluster components 1a nodes

Cluster Components...1aNodes

  • Multiple High Performance Components:

    • PCs

    • Workstations

    • SMPs (CLUMPS)

    • Distributed HPC Systems leading to Metacomputing

  • They can be based on different architectures and running difference OS


Cluster components 1b processors

Cluster Components...1bProcessors

  • There are many (CISC/RISC/VLIW/Vector..)

    • Intel: Pentiums, Xeon, Merceed….

    • Sun: SPARC, ULTRASPARC

    • HP PA

    • IBM RS6000/PowerPC

    • SGI MPIS

    • Digital Alphas

  • Integrate Memory, processing and networking into a single chip

    • IRAM (CPU & Mem): (http://iram.cs.berkeley.edu)

    • Alpha 21366 (CPU, Memory Controller, NI)


Cluster components 2 os

Cluster Components…2OS

  • State of the art OS:

    • Linux(Beowulf)

    • Microsoft NT(Illinois HPVM)

    • SUN Solaris(Berkeley NOW)

    • IBM AIX(IBM SP2)

    • HP UX(Illinois - PANDA)

    • Mach (Microkernel based OS) (CMU)

    • Cluster Operating Systems (Solaris MC, SCO Unixware, MOSIX (academic project)

    • OS gluing layers:(Berkeley Glunix)


Cluster components 3 high performance networks

Cluster Components…3High Performance Networks

  • Ethernet (10Mbps),

  • Fast Ethernet (100Mbps),

  • Gigabit Ethernet (1Gbps)

  • SCI (Dolphin - MPI- 12micro-sec latency)

  • ATM

  • Myrinet (1.2Gbps)

  • Digital Memory Channel

  • FDDI


Cluster components 4 network interfaces

Cluster Components…4Network Interfaces

  • Network Interface Card

    • Myrinet has NIC

    • User-level access support

    • Alpha 21364 processor integrates processing, memory controller, network interface into a single chip..


Cluster components 5 communication software

Cluster Components…5 Communication Software

  • Traditional OS supported facilities (heavy weight due to protocol processing)..

    • Sockets (TCP/IP), Pipes, etc.

  • Light weight protocols (User Level)

    • Active Messages (Berkeley)

    • Fast Messages (Illinois)

    • U-net (Cornell)

    • XTP (Virginia)

  • System systems can be built on top of the above protocols


Cluster components 6a cluster middleware

Cluster Components…6aCluster Middleware

  • Resides Between OS and Applications and offers in infrastructure for supporting:

    • Single System Image (SSI)

    • System Availability (SA)

  • SSI makes collection appear as single machine (globalised view of system resources). Telnet cluster.myinstitute.edu

  • SA - Check pointing and process migration..


Cluster components 6b middleware components

Cluster Components…6bMiddleware Components

  • Hardware

    • DEC Memory Channel, DSM (Alewife, DASH) SMP Techniques

  • OS / Gluing Layers

    • Solaris MC, Unixware, Glunix)

  • Applications and Subsystems

    • System management and electronic forms

    • Runtime systems (software DSM, PFS etc.)

    • Resource management and scheduling (RMS):

      • CODINE, LSF, PBS, NQS, etc.


Cluster components 7a programming environments

Cluster Components…7aProgramming environments

  • Threads (PCs, SMPs, NOW..)

    • POSIX Threads

    • Java Threads

  • MPI

    • Linux, NT, on many Supercomputers

  • PVM

  • Software DSMs (Shmem)


Cluster components 7b development tools

Cluster Components…7bDevelopment Tools ?

  • Compilers

    • C/C++/Java/ ;

    • Parallel programming with C++ (MIT Press book)

  • RAD (rapid application development tools).. GUI based tools for PP modeling

  • Debuggers

  • Performance Analysis Tools

  • Visualization Tools


Cluster components 8 applications

Cluster Components…8Applications

  • Sequential

  • Parallel / Distributed (Cluster-aware app.)

    • Grand Challenging applications

      • Weather Forecasting

      • Quantum Chemistry

      • Molecular Biology Modeling

      • Engineering Analysis (CAD/CAM)

      • ……………….

    • PDBs, web servers,data-mining


Key operational benefits of clustering

Key Operational Benefits of Clustering

  • System availability (HA). offer inherent high system availability due to the redundancy of hardware, operating systems, and applications.

  • Hardware Fault Tolerance. redundancy for most system components (eg. disk-RAID), including both hardware and software.

  • OS and application reliability. run multiple copies of the OS and applications, and through this redundancy

  • Scalability. adding servers to the cluster or by adding more clusters to the network as the need arises or CPU to SMP.

  • High Performance. (running cluster enabled programs)


High performance cluster computing

Classification

of Cluster Computer


Clusters classification 1

Clusters Classification..1

  • Based on Focus (in Market)

    • High Performance (HP) Clusters

      • Grand Challenging Applications

    • High Availability (HA) Clusters

      • Mission Critical applications


Ha cluster server cluster with heartbeat connection

HA Cluster: Server Cluster with "Heartbeat" Connection


Clusters classification 2

Clusters Classification..2

  • Based on Workstation/PC Ownership

    • Dedicated Clusters

    • Non-dedicated clusters

      • Adaptive parallel computing

      • Also called Communal multiprocessing


Clusters classification 3

Clusters Classification..3

  • Based on Node Architecture..

    • Clusters of PCs (CoPs)

    • Clusters of Workstations (COWs)

    • Clusters of SMPs (CLUMPs)


Building scalable systems cluster of smps clumps

Building Scalable Systems: Cluster of SMPs (Clumps)

Performance of SMP Systems Vs. Four-Processor Servers in a Cluster


Clusters classification 4

Clusters Classification..4

  • Based on Node OS Type..

    • Linux Clusters (Beowulf)

    • Solaris Clusters (Berkeley NOW)

    • NT Clusters (HPVM)

    • AIX Clusters (IBM SP2)

    • SCO/Compaq Clusters (Unixware)

    • …….Digital VMS Clusters, HP clusters, ………………..


Clusters classification 5

Clusters Classification..5

  • Based on node components architecture & configuration (Processor Arch, Node Type: PC/Workstation.. & OS: Linux/NT..):

    • Homogeneous Clusters

      • All nodes will have similar configuration

    • Heterogeneous Clusters

      • Nodes based on different processors and running different OSes.


Clusters classification 6a dimensions of scalability levels of clustering

Network

Technology

Platform

Clusters Classification..6aDimensions of Scalability & Levels of Clustering

(3)

Public

Metacomputing

(1)

Enterprise

Campus

Department

CPU / I/O / Memory / OS

Workgroup

Uniprocessor

SMP

Cluster

(2)

MPP


Clusters classification 6b levels of clustering

Clusters Classification..6bLevels of Clustering

  • Group Clusters (#nodes: 2-99)

    • (a set of dedicated/non-dedicated computers - mainly connected by SAN like Myrinet)

  • Departmental Clusters (#nodes: 99-999)

  • Organizational Clusters (#nodes: many 100s)

  • (using ATMs Net)

  • Internet-wide Clusters=Global Clusters: (#nodes: 1000s to many millions)

    • Metacomputing

    • Web-based Computing

    • Agent Based Computing

      • Java plays a major in web and agent based computing


High performance cluster computing

Cluster Middleware

and

Single System Image


Contents

Contents

  • What is Middleware ?

  • What is Single System Image ?

  • Benefits of Single System Image

  • SSI Boundaries

  • SSI Levels

  • Relationship between Middleware Modules.

  • Strategy for SSI via OS

  • Solaris MC: An example OS supporting SSI

  • Cluster Monitoring Software


What is cluster middleware

What is Cluster Middleware ?

  • An interface between between use applications and cluster hardware and OS platform.

  • Middleware packages support each other at the management, programming, and implementation levels.

  • Middleware Layers:

    • SSI Layer

    • Availability Layer: It enables the cluster services of

      • Checkpointing, Automatic Failover, recovery from failure,

      • fault-tolerant operating among all cluster nodes.


Middleware design goals

Middleware Design Goals

  • Complete Transparency

    • Lets the see a single cluster system..

      • Single entry point, ftp, telnet, software loading...

  • Scalable Performance

    • Easy growth of cluster

      • no change of API & automatic load distribution.

  • Enhanced Availability

    • Automatic Recovery from failures

      • Employ checkpointing & fault tolerant technologies

    • Handle consistency of data when replicated..


What is single system image ssi

What is Single System Image (SSI) ?

  • A single system image is the illusion, created by software or hardware, that a collection of computing elements appear as a single computing resource.

  • SSI makes the cluster appear like a single machine to the user, to applications, and to the network.

  • A cluster without a SSI is not a cluster


Benefits of single system image

Benefits of Single System Image

  • Usage of system resources transparently

  • Improved reliability and higher availability

  • Simplified system management

  • Reduction in the risk of operator errors

  • User need not be aware of the underlying system architecture to use these machines effectively


Ssi vs scalability design space of competing arch

SSI vs. Scalability(design space of competing arch.)


Desired ssi services

Desired SSI Services

  • Single Entry Point

    • telnet cluster.my_institute.edu

    • telnet node1.cluster. institute.edu

  • Single File Hierarchy: xFS, AFS, Solaris MC Proxy

  • Single Control Point: Management from single GUI

  • Single virtual networking

  • Single memory space - DSM

  • Single Job Management: Glunix, Condin, LSF

  • Single User Interface: Like workstation/PC windowing environment (CDE in Solaris/NT), may it can use Web technology


Availability support functions

Availability Support Functions

  • Single I/O Space (SIO):

    • any node can access any peripheral or disk devices without the knowledge of physical location.

  • Single Process Space (SPS)

    • Any process on any node create process with cluster wide process wide and they communicate through signal, pipes, etc, as if they are one a single node.

  • Checkpointing and Process Migration.

    • Saves the process state and intermediate results in memory to disk to support rollback recovery when node fails. PM for Load balancing...

  • Reduction in the risk of operator errors

  • User need not be aware of the underlying system architecture to use these machines effectively


Ssi levels

Application and Subsystem Level

Operating System Kernel Level

SSI Levels

  • It is a computer science notion of levels of abstractions (house is at a higher level of abstraction than walls, ceilings, and floors).

Hardware Level


Ssi at application and subsystem level

Level

Examples

Boundary

Importance

application

cluster batch system,

system management

an application

what a user

wants

distributed DB,

OSF DME, Lotus

Notes, MPI, PVM

subsystem

a subsystem

SSI for all

applications of

the subsystem

Sun NFS, OSF,

DFS, NetWare,

and so on

file system

shared portion of

the file system

implicitly supports

many applications

and subsystems

toolkit

OSF DCE, Sun

ONC+, Apollo

Domain

explicit toolkit

facilities: user,

service name,time

best level of

support for heter-

ogeneous system

SSI at Application and Subsystem Level

(c) In search of clusters


Ssi at operating system kernel level

Level

Examples

Boundary

Importance

Kernel/

OS Layer

each name space:

files, processes,

pipes, devices, etc.

Solaris MC, Unixware

MOSIX, Sprite,Amoeba

/ GLunix

kernel support for

applications, adm

subsystems

UNIX (Sun) vnode,

Locus (IBM) vproc

type of kernel

objects: files,

processes, etc.

kernel

interfaces

modularizes SSI

code within

kernel

none supporting

operating system kernel

virtual

memory

each distributed

virtual memory

space

may simplify

implementation

of kernel objects

Mach, PARAS, Chorus,

OSF/1AD, Amoeba

each service

outside the

microkernel

implicit SSI for

all system services

microkernel

SSI at Operating System Kernel Level

(c) In search of clusters


Ssi at harware level

Application and Subsystem Level

Operating System Kernel Level

SSI at Harware Level

Level

Examples

Boundary

Importance

SCI, DASH

memory

memory space

better communica-

tion and synchro-

nization

SCI, SMP techniques

memory and I/O

device space

lower overhead

cluster I/O

memory

and I/O

(c) In search of clusters


Ssi characteristics

SSI Characteristics

  • 1. Every SSI has a boundary

  • 2. Single system support can exist at different levels within a system, one able to be build on another


Ssi boundaries an applications ssi boundary

SSI

Boundary

SSI Boundaries -- an applications SSI boundary

Batch System

(c) In search

of clusters


Relationship among middleware modules

Relationship Among Middleware Modules


Parmon a cluster monitoring tool

parmon

parmond

PARMON

High-Speed

Switch

PARMON: A Cluster Monitoring Tool

PARMON Server

on Solaris Node

PARMON Client on JVM


Motivations

Motivations

  • Monitoring such huge systems is a tedious and challenging task since typical workstations are designed to work as a standalone system, rather than a part of workstation clusters.

  • System administrators require tools to effectively monitor such huge systems. PARMON provides the solution to this challenging problem.


Parmon salient features

PARMON - Salient Features

  • Allows to monitor system activities at Component, Node, Group, or entire Cluster level monitoring

  • Monitoring of System Components :

    • CPU, Memory, Disk and Network

  • Allows to monitor multiple instances of the same component.

  • PARMON provides GUI interface for initiating activities/request and presents results graphically.


Resource utilization at a glance

Resource Utilization at a Glance


Cpu usage monitoring

CPU Usage Monitoring


Memory usage monitoring

Memory Usage monitoring


Kernel data catalog cpu

Kernel Data Catalog - CPU


Strategy for ssi via os

Strategy for SSI via OS

  • 1. Build as a layer on top of the existing OS. (eg. Glunix)

    • Benefits: makes the system quickly portable, tracks vendor software upgrades, and reduces development time.

    • i.e. new systems can be built quickly by mapping new services onto the functionality provided by the layer beneath. Eg: Glunix/Solaris-MC

  • 2. Build SSI at kernel level, True Cluster OS

    • Good, but Can’t leverage of OS improvements by vendor

    • E.g. Unixware and Mosix (built using BSD Unix)


Cluster computing research projects

Cluster Computing - Research Projects

  • Beowulf (CalTech and Nasa) - USA

  • CCS (Computing Centre Software) - Paderborn, Germany

  • Condor - Wisconsin State University, USA

  • DJM (Distributed Job Manager) - Minnesota Supercomputing Center

  • DQS (Distributed Queuing System) - Florida State University, USA

  • EASY - Argonne National Lab, USA

  • HPVM -(High Performance Virtual Machine),UIUC&now UCSB,US

  • far - University of Liverpool, UK

  • Gardens - Queensland University of Technology, Australia

  • Generic NQS (Network Queuing System),University of Sheffield, UK

  • NOW (Network of Workstations) - Berkeley, USA

  • NIMROD - Monash University, Australia

  • PBS (Portable Batch System) - NASA Ames and LLNL, USA

  • PRM (Prospero Resource Manager) - Uni. of S. California, USA

  • QBATCH - Vita Services Ltd., USA


Cluster computing commercial software

Cluster Computing - Commercial Software

  • Codine (Computing in Distributed Network Environment) - GENIAS GmbH, Germany

  • LoadLeveler - IBM Corp., USA

  • LSF (Load Sharing Facility) - Platform Computing, Canada

  • NQE (Network Queuing Environment) - Craysoft Corp., USA

  • OpenFrame - Centre for Development of Advanced Computing, India

  • RWPC (Real World Computing Partnership), Japan

  • Unixware (SCO-Santa Cruz Operations,), USA

  • Solaris-MC (Sun Microsystems), USA


Representative cluster systems 1 solaris mc 2 berkeley now 3 their comparison with beowulf hpvm

Representative Cluster Systems1. Solaris -MC2. Berkeley NOW3. their comparison with Beowulf & HPVM


Next generation distributed computing the solaris mc operating system

Next Generation Distributed Computing:The Solaris MC Operating System


Why new software

Why new software?

  • Without software, a cluster is:

    • Just a network of machines

    • Requires specialized applications

    • Hard to administer

  • With a cluster operating system:

    • Cluster becomes a scalable, modular computer

    • Users and administrators see a single large machine

    • Runs existing applications

    • Easy to administer

  • New software makes cluster better for the customer


Cluster computing and solaris mc

Cluster computing and Solaris MC

  • Goal: use computer clusters for general-purpose computing

  • Support existing customers and applications

  • Solution: Solaris MC (Multi Computer) operating system

    A distributed operating system (OS) for multi-computers


What is the solaris mc os

What is the Solaris MC OS ?

  • Solaris MC extends standard Solaris

  • Solaris MC makes the cluster look like a single machine

    • Global file system

    • Global process management

    • Global networking

  • Solaris MC runs existing applications unchanged

    • Supports Solaris ANI (Application binary interface)


Applications

Applications

  • Ideal for:

    • Web and interactive servers

    • Databases

    • File servers

    • Timesharing

  • Benefits for vendors and customers

    • Preserves investment in existing applications

    • Modular servers with low entry-point price and low cost of ownership

    • Easier system administraion

    • Solaris could become a preferred platform for clustered systems


Solaris mc is a running research system

Solaris MC is a running research system

  • Designed, built and demonstrated Solaris MC prototype

    • CLuster of SPARCstations connected with Myrinet network

    • Runs unmodified commercial parallel database, scalable Web server, parallel make

  • Next: Solaris MC Phase II

    • High availability

    • New I/O work to take advantage of clusters

    • Performance evaluation


Advantages of solaris mc

Advantages of Solaris MC

  • Leverages continuing investment in Solaris

    • Same applications: binary-compatible

    • Same kernel, device drivers, etc.

    • As portable as base Solaris - will run on SPARC, x86, PowerPC

  • State of the art distributed systems techniques

    • High availability designed into the sytem

    • Powerful distributed object-oriented framework

  • Ease of administration and use

    • Looks like a familiar multiprocessor server to users, sytem administrators, and applications


Solaris mc details

Solaris MC details

  • Solaris MC is a set of C++ loadable modules on top of Solaris

    • Very few changes to existing kernel

  • A private Solaris kernel per node: provides reliability

  • Object-oriented system with well-defined interfaces


Key components of solaris mc proving ssi

Key components of Solaris-MC proving SSI

  • global file system

  • globalized process management

  • globalized networking and I/O


Solaris mc components

Solaris MC components

  • Object and communication support

  • High availability support

  • PXFS global distributed file system

  • Process mangement

  • Networking


Object oreintation

Object Oreintation

  • Better software maintenance, change, and evolution

    • Well-defined interfaces

    • Separate implementation from interface

    • Interface inheritance

  • Solaris MC uses:

    • IDL: a better way to define interfaces

    • CORBA object model: a better RPC (Remote Procedure Call)

    • C++: a better C


Object and communication framework

Object and Communication Framework

  • Mechanism for nodes and modules to communicate

    • Inter-node and intra-node interprocess communication

  • Optimized protocols for trusted computing base

  • Efficient, low-latency communication primitives

  • Object communication independent of interconnect

    • We use Ethernet, fast Ethernet, FibreChannel, Myrinet

    • Allows interconnect hardware to be upgraded


High availability support

High Availability Support

  • Node failure doesn’t crash entire system

    • Unaffected nodes continue running

    • Better than a SMP

    • A requirement for mission critical market

  • Well-defined failure boundaries

    • Separate kernel per node - OS does not use shared memory

  • Object framework provides support

    • Delivers failure notifications to servers and clients

    • Group membership protocol detects node failures

  • Each subsystem resposible for its recovery

    • Filesystem, process management, networking, applications


Pxfs global filesystem

PXFS: Global Filesystem

  • Single-system image of file sytem

  • Backbone of Solaris MC

  • Coherent access and caching of files and directories

    • Caching provides high performance

  • Access to I/O devices


Pxfs an object oriented vfs

PXFS: An object-oriented VFS

  • PXFS builds on existing Solaris file sytems

    • Uses the vnode/virtual file system interface (VFS) externally

    • Uses object communication internally


Process management

Process management

  • Provide global view of processes on any node

    • Users, administrators, and applications see global view

    • Supports existing applications

  • Uniform support for local and remote processes

    • Process creation/waiting/exiting (including remote execution)

    • Global process identifiers, groups, sessions

    • Signal handling

    • procfs (/proc)


Process management benefits

Process management benefits

  • Global process management helps users and administrators

  • Users see familiar single machine process model

  • Can run programs on any node

  • Location of process in the cluster doesn’t matter

  • Use existing commands and tools: unmodified ps, kill, etc.


Networking goals

Networking goals

  • Cluster appears externally as a single SMP server

    • Familiar to customers

    • Access cluster through single network address

    • Multiple network interfaces supported but not required

  • Scalable design

    • protocol and network application processing on any mode

    • Parallelism provides high server performance


Networking implementation

Networking: Implementation

  • A programmable “packet filter”

    • Packets routed between network device and the correct node

    • Efficient, scalable, and supports parallelism

    • Supports multiple protocols with existing protocol stacks

  • Parallelism of protocol processing and applications

    • Incoming connections are load-balanced across the cluster


Status

Status

4 node, 8 CPU prototype with Myrinet demonstrated

Object and communication infrastructure

Global file system (PXFS) with coherency and caching

Networking TCP/IP with load balancing

Global process management (ps, kill, exec, wait, rfork, /proc)

Monitoring tools

Cluster membership protocols

Demonstrated applications

Commercial parallel database

Scalable Web server

Parallel make

Timesharing

  • Solaris-MC team is working on high availability


Summary of solaris mc

Summary of Solaris MC

  • Clusters likely to be an important market

  • Solaris MC preserves customer investment in Solaris

    • Uses existing Solaris applications

  • Familiar to customers

    • Looks like a multiprocessor, not a special cluster architecture

  • Ease of administration and use

  • Clusters are ideal for important applications

    • Web server, file server, databases, interactive services

  • State-of-the-art object-oriented distributed implementation

    • Designed for future growth


High performance cluster computing

Berkeley NOW Project


Now @ berkeley

NOW @ Berkeley

  • Design & Implementation of higher-level system

    • Global OS (Glunix)

    • Parallel File Systems (xFS)

    • Fast Communication (HW for Active Messages)

    • Application Support

  • Overcoming technology shortcomings

    • Fault tolerance

    • System Management

  • NOW Goal: Faster for Parallel AND Sequential


Now software components

Parallel Apps

Large Seq. Apps

Sockets, Split-C, MPI, HPF, vSM

Active Messages

Name Svr

Global Layer Unix

Unix

Workstation

Unix

Workstation

Unix

Workstation

Unix (Solaris)

Workstation

Scheduler

VN segment Driver

VN segment Driver

VN segment Driver

VN segment Driver

AM L.C.P.

AM L.C.P.

AM L.C.P.

AM L.C.P.

Myrinet Scalable Interconnect

NOW Software Components


Active messages lightweight communication protocol

Active Messages: Lightweight Communication Protocol

  • Key Idea: Network Process ID attached to every message that HW checks upon receipt

    • Net PID match, as fast as before

    • Net PIC mismatch, interrupt and invoke OS

  • Can mix LAN messages and MPP messages;invoke OS & TCP/IP only when not cooperating (if everyone uses same physical layer format)


Mpp active messages

MPP Active Messages

  • Key Idea: associate a small user-level handler directly with each message

    • Sender injects the message directly into the network

    • Handler executes immediately upon arrival

    • Pulls the message out of the network and integrates it into the ongoing computation, or replies

    • No buffering (beyond transport), no parsing, no allocation, primitive scheduling


Active message model

Active Message Model

  • Every message contains at its header the address of a user level handler which gets executed immediately in user level

  • No receive side buffering of messages

  • Supports protected multiprogramming of a large number of users onto finite physical network resource

  • Active message operations, communication events and threads are integrated in a simple and cohesive model

  • Provides naming and protection


Active message model contd

primary

computation

primary

computation

data

structs

data

structs

handler

Active Message

Network

data

pc

Active Message Model (Contd..)


Xfs file system for now

xFS: File System for NOW

  • Serverless File System: All data with clients

    • Uses MP cache coherency to reduce traffic

  • Files striped for parallel transfer

  • Large file cache (“cooperative caching-Network RAM”)

    Miss Rate Response Time

    Client/Server10%1.8 ms

    xFS 4%1.0 ms

    (42 WS, 32 MB/WS, 512 MB/server, 8KB/access)


Glunix gluing unix

Glunix: Gluing Unix

  • It is built onto of Solaris

  • It glues together Solaris running on Cluster nodes.

  • Support transparent remote execution, load balancing, allows to run existing applications.

  • Provides globalized view of system resources like SolarisMC

  • Gang schedule parallel jobs to be as good as dedicated MPP for parallel jobs


3 paths for applications on now

3 Paths for Applications on NOW?

  • Revolutionary (MPP Style): write new programs from scratch using MPP languages, compilers, libraries,…

  • Porting: port programs from mainframes, supercomputers, MPPs, …

  • Evolutionary: take sequential program & use

    1)Network RAM: first use memory of many computers to reduce disk accesses; if not fast enough, then:

    2)Parallel I/O: use many disks in parallel for accesses not in file cache; if not fast enough, then:

    3)Parallel program: change program until it sees enough processors that is fast

    =>Large speedup without fine grain parallel program


High performance cluster computing

Comparison of 4 Cluster Systems


High performance cluster computing

Pointers to Literature on Cluster Computing


Reading resources 1a internet www

Reading Resources..1aInternet & WWW

  • Computer Architecture:

    • http://www.cs.wisc.edu/~arch/www/

  • PFS & Parallel I/O

    • http://www.cs.dartmouth.edu/pario/

  • Linux Parallel Procesing

    • http://yara.ecn.purdue.edu/~pplinux/Sites/

  • DSMs

    • http://www.cs.umd.edu/~keleher/dsm.html


Reading resources 1b internet www

Reading Resources..1bInternet & WWW

  • Solaris-MC

    • http://www.sunlabs.com/research/solaris-mc

  • Microprocessors: Recent Advances

    • http://www.microprocessor.sscc.ru

  • Beowulf:

    • http://www.beowulf.org

  • Metacomputing

    • http://www.sis.port.ac.uk/~mab/Metacomputing/


Reading resources 2 books

Reading Resources..2Books

  • In Search of Cluster

    • by G.Pfister, Prentice Hall (2ed), 98

  • High Performance Cluster Computing

    • Volume1: Architectures and Systems

    • Volume2: Programming and Applications

      • Edited by Rajkumar Buyya, Prentice Hall, NJ, USA.

  • Scalable Parallel Computing

    • by K Hwang & Zhu, McGraw Hill,98


Reading resources 3 journals

Reading Resources..3Journals

  • A Case of NOW, IEEE Micro, Feb’95

    • by Anderson, Culler, Paterson

  • Fault Tolerant COW with SSI, IEEE Concurrency, (to appear)

    • by Kai Hwang, Chow, Wang, Jin, Xu

  • Cluster Computing: The Commodity Supercomputing, Journal of Software Practice and Experience-(get from my web)

    • by Mark Baker & Rajkumar Buyya


Clusters revisited

Clusters Revisited


Summary

Summary

  • We have discussed Clusters

    • Enabling Technologies

    • Architecture & its Components

    • Classifications

    • Middleware

    • Single System Image

    • Representative Systems


Conclusions

Conclusions

  • Clusters are promising..

    • Solve parallel processing paradox

    • Offer incremental growth and matches with funding pattern.

    • New trends in hardware and software technologies are likely to make clusters more promising..so that

    • Clusters based supercomputers can be seen everywhere!


Breaking high performance computing barriers

2100

2100

2100

2100

2100

2100

2100

2100

2100

Breaking High Performance Computing Barriers

GFLOPS

Single

Processor

Shared

Memory

Local

Parallel

Cluster

Global

Parallel

Cluster


Well read my book for

Thank You ...

?

Well, Read my book for….

http://www.dgs.monash.edu.au/~rajkumar/cluster/


High performance cluster computing

Thank You ...

?


  • Login