cluster computing l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Cluster Computing PowerPoint Presentation
Download Presentation
Cluster Computing

Loading in 2 Seconds...

play fullscreen
1 / 59

Cluster Computing - PowerPoint PPT Presentation


  • 769 Views
  • Uploaded on

Cluster Computing. Kick-start seminar 16 December, 2009 High Performance Cluster Computing Centre (HPCCC) Faculty of Science Hong Kong Baptist University. Outline. Overview of Cluster Hardware and Software Basic Login and Running Program in a job queuing system

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Cluster Computing' - Leo


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
cluster computing

Cluster Computing

Kick-start seminar

16 December, 2009

High Performance Cluster Computing Centre (HPCCC)

Faculty of Science

Hong Kong Baptist University

outline
Outline

Overview of Cluster Hardware and Software

Basic Login and Running Program in a job queuing system

Introduction to Parallelism

Why Parallelism & Cluster Parallelism

Message Passing Interface

Parallel Program Examples

Policy for using sciblade.sci.hkbu.edu.hk

Contact

http://www.sci.hkbu.edu.hk/hpccc2009/sciblade

2

cluster hardware
Cluster Hardware

This 256-node PC cluster (sciblade) consist of:

Master node x 2

IO nodes x 3 (storage)

Compute nodes x 256

Blade Chassis x 16

Management network

Interconnect fabric

1U console & KVM switch

Emerson Liebert Nxa 120k VA UPS

4

sciblade cluster
Sciblade Cluster

256-node clusters supported by fund from RGC

5

hardware configuration
Hardware Configuration

Master Node

Dell PE1950, 2x Xeon E5450 3.0GHz (Quad Core)

16GB RAM, 73GB x 2 SAS drive

IO nodes (Storage)

Dell PE2950, 2x Xeon E5450 3.0GHz (Quad Core)

16GB RAM, 73GB x 2 SAS drive

3TB storage Dell PE MD3000

Compute nodes x 256 each

Dell PE M600 blade server w/ Infiniband network

2x Xeon E5430 2.66GHz (Quad Core)

16GB RAM, 73GB SAS drive

6

hardware configuration7
Hardware Configuration

Blade Chassis x 16

Dell PE M1000e

Each hosts 16 blade servers

Management Network

Dell PowerConnect 6248 (Gigabit Ethernet) x 6

Inerconnect fabric

Qlogic SilverStorm 9120 switch

Console and KVM switch

Dell AS-180 KVM

Dell 17FP Rack console

Emerson Liebert Nxa 120kVA UPS

7

software list
Software List

Operating System

ROCKS 5.1 Cluster OS

CentOS 5.3 kernel 2.6.18

Job Management System

Portable Batch System

MAUI scheduler

Compilers, Languages

Intel Fortran/C/C++ Compiler for Linux V11

GNU 4.1.2/4.4.0 Fortran/C/C++ Compiler

8

software list9
Software List

Message Passing Interface (MPI) Libraries

MVAPICH 1.1

MVAPICH2 1.2

OPEN MPI 1.3.2

Mathematic libraries

ATLAS 3.8.3

FFTW 2.1.5/3.2.1

SPRNG 2.0a(C/Fortran)

9

software list10
Software List
  • Molecular Dynamics & Quantum Chemistry
    • Gromacs 4.0.5
    • Gamess 2009R1
    • Namd 2.7b1
  • Third-party Applications
    • GROMACS, NAMD
    • MATLAB 2008b
    • TAU 2.18.2, VisIt 1.11.2
    • VMD 1.8.6, Xmgrace 5.1.22
    • etc

10

software list11
Software List

Queuing system

Torque/PBS

Maui scheduler

Editors

vi

emacs

11

hostnames
Hostnames

Master node

External : sciblade.sci.hkbu.edu.hk

Internal : frontend-0

IO nodes (storage)

pvfs2-io-0-0, pvfs2-io-0-1, pvfs-io-0-2

Compute nodes

compute-0-0.local, …, compute-0-255.local

12

basic login
Basic login

Remote login to the master node

Terminal login

using secure shell

ssh -l username sciblade.sci.hkbu.edu.hk

Graphical login

PuTTY & vncviewer e.g.

[username@sciblade]$ vncserver

New ‘sciblade.sci.hkbu.edu.hk:3 (username)' desktop is sciblade.sci.hkbu.edu.hk:3

It means that your session will run on display 3.

14

graphical login
Graphical login

Using PuTTY to setup a secured connection: Host Name=sciblade.sci.hkbu.edu.hk

15

graphical login con t
Graphical login (con’t)

ssh protocol version

16

graphical login con t17
Graphical login (con’t)

Port 5900 +display numbe (i.e. 3 in this case)

17

graphical login con t18
Graphical login (con’t)

Next, click Open, and login to sciblade

Finally, run VNC Viewer on your PC, and enter "localhost:3" {3 is the display number}

You should terminate your VNC session after you have finished your work. To terminate your VNC session running on sciblade, run the command

[username@tdgrocks] $ vncserver –kill : 3

18

linux commands
Linux commands

Both master and compute nodes are installed with Linux

Frequently used Linux command in PC cluster http://www.sci.hkbu.edu.hk/hpccc2009/sciblade/faq_sciblade.php

19

rocks specific commands
ROCKS specific commands

ROCKS provides the following commands for users to run programs in all compute node. e.g.

cluster-fork

Run program in all compute nodes

cluster-fork ps

Check user process in each compute node

cluster-kill

Kill user process at one time

tentakel

Similar to cluster-fork but run faster

20

ganglia
Ganglia
  • Web based management and monitoring
  • http://sciblade.sci.hkbu.edu.hk/ganglia

21

why parallelism passively
Why Parallelism – Passively

Suppose you are using the most efficient algorithm with an optimal implementation, but the program still takes too long or does not even fit onto your machine

Parallelization is the last chance.

23

why parallelism initiative
Why Parallelism – Initiative

Faster

Finish the work earlier

Same work in shorter time

Do more work

More work in the same time

Most importantly, you want to predict the result before the event occurs

24

examples
Examples

Many of the scientific and engineering problems require enormous computational power.

Following are the few fields to mention:

Quantum chemistry, statistical mechanics, and relativistic physics

Cosmology and astrophysics

Computational fluid dynamics and turbulence

Material design and superconductivity

Biology, pharmacology, genome sequencing, genetic engineering, protein folding, enzyme activity, and cell modeling

Medicine, and modeling of human organs and bones

Global weather and environmental modeling

Machine Vision

25

parallelism
Parallelism

The upper bound for the computing power that can be obtained from a single processor is limited by the fastest processor available at any certain time.

The upper bound for the computing power available can be dramatically increased by integrating a set of processors together.

Synchronization and exchange of partial results among processors are therefore unavoidable.

26

multiprocessing clustering
Multiprocessing Clustering

LM

LM

LM

LM

CU

CU

CU

CU

n

1

2

n-1

2

1

n-1

n

DS

DS

DS

DS

IS

IS

IS

IS

CPU

CPU

CPU

CPU

PU

PU

PU

PU

1

2

n-1

n

n

1

2

n-1

DS

DS

DS

DS

Interconnecting Network

Shared Memory

Parallel Computer Architecture

Distributed Memory –

Cluster

Shared Memory –

Symmetric multiprocessors (SMP)

27

clustering pros and cons
Clustering: Pros and Cons

Advantages

Memory scalable to number of processors.

∴Increase number of processors, size of memory and bandwidth as well.

Each processor can rapidly access its own memory without interference

Disadvantages

Difficult to map existing data structures to this memory organization

User is responsible for sending and receiving data among processors

28

parallel programming paradigm
Parallel Programming Paradigm
  • Multithreading
    • OpenMP
  • Message Passing
    • MPI (Message Passing Interface)
    • PVM (Parallel Virtual Machine)

Shared memory only

Shared memory, Distributed memory

31

distributed memory
Distributed Memory

Programmers view:

Several CPUs

Several block of memory

Several threads of action

Parallelization

Done by hand

Example

MPI

Serial

Process

P1

P2

P3

P1

Process 0

Data exchange via

interconnection

Message

Passing

P2

Process 1

P3

Process 2

time

32

message passing model
Message Passing Model

P1

P2

P3

Serial

Process 0

P1

Process 1

P2

Message

Passing

Process 2

P3

time

Data exchange

Process

A process is a set of executable instructions (program) which runs on a processor.

Message passing systems generally associate only one process per processor, and the terms "processes" and "processors" are used interchangeably

Message Passing

The method by which data from one processor's memory is copied to the memory of another processor.

33

slide35
MPI

Is a library but not a language, for parallel programming

An MPI implementation consists of

a subroutine library with all MPI functions

include files for the calling application program

some startup script (usually called mpirun, but not standardized)

Include the lib file mpi.h (or however called) into the source code

Libraries available for all major imperative languages (C, C++, Fortran …)

35

general mpi program structure
General MPI Program Structure

MPI include file

#include <mpi.h>

void main (int argc, char *argv[])

{

int np, rank, ierr;

ierr = MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD,&rank);

MPI_Comm_size(MPI_COMM_WORLD,&np);

/* Do Some Works */

ierr = MPI_Finalize();

}

#include <mpi.h>

void main (int argc, char *argv[])

{

int np, rank, ierr;

ierr = MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD,&rank);

MPI_Comm_size(MPI_COMM_WORLD,&np);

/* Do Some Works */

ierr = MPI_Finalize();

}

#include <mpi.h>

void main (int argc, char *argv[])

{

int np, rank, ierr;

ierr = MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD,&rank);

MPI_Comm_size(MPI_COMM_WORLD,&np);

/* Do Some Works */

ierr = MPI_Finalize();

}

variable declarations

#include <mpi.h>

void main (int argc, char *argv[])

{

int np, rank, ierr;

ierr = MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD,&rank);

MPI_Comm_size(MPI_COMM_WORLD,&np);

/* Do Some Works */

ierr = MPI_Finalize();

}

#include <mpi.h>

void main (int argc, char *argv[])

{

int np, rank, ierr;

ierr = MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD,&rank);

MPI_Comm_size(MPI_COMM_WORLD,&np);

/* Do Some Works */

ierr = MPI_Finalize();

}

Initialize MPI environment

Do work and make

message passing calls

Terminate MPI Environment

36

sample program hello world
Sample Program: Hello World!

In this modified version of the "Hello World" program, each processor prints its rank as well as the total number of processors in the communicator MPI_COMM_WORLD.

Notes:

Makes use of the pre-defined communicator MPI_COMM_WORLD.

Not testing for error status of routines!

37

sample program hello world38
Sample Program: Hello World!

#include <stdio.h>

#include “mpi.h” // MPI compiler header file

void main(int argc, char **argv)

{

int nproc,myrank,ierr;

ierr=MPI_Init(&argc,&argv); // MPI initialization

// Get number of MPI processes

MPI_Comm_size(MPI_COMM_WORLD,&nproc);

// Get process id for this processor

MPI_Comm_rank(MPI_COMM_WORLD,&myrank);

printf (“Hello World!! I’m process %d of %d\n”,myrank,nproc);

ierr=MPI_Finalize(); // Terminate all MPI processes

}

38

performance
Performance

When we write a parallel program, it is important to identify the fraction of the program that can be parallelized and to maximize it.

The goals are:

load balance

memory usage balance

minimize communication overhead

reduce sequential bottlenecks

scalability

39

compiling running mpi programs
Compiling & Running MPI Programs
  • Using mvapich 1.1
  • Setting path, at the command prompt, type:

export PATH=/u1/local/mvapich1/bin:$PATH

(uncomment this line in .bashrc)

  • Compile using mpicc, mpiCC, mpif77 or mpif90, e.g.

mpicc –o cpi cpi.c

  • Prepare hostfile (e.g. machines) number of compute nodes:

compute-0-0

compute-0-1

compute-0-2

compute-0-3

  • Run the program with a number of processor node:

mpirun –np 4 –machinefile machines ./cpi

40

compiling running mpi programs41
Compiling & Running MPI Programs
  • Using mvapich 1.2
  • Prepare .mpd.conf and .mpd.passwd and saved in your home directory :

MPD_SECRETWORD=gde1234-3

(you may set your own secret word)

  • Setting environment for mvapich 1.2

export MPD_BIN=/u1/local/mvapich2

export PATH=$MPD_BIN:$PATH

(uncomment this line in .bashrc)

  • Compile using mpicc, mpiCC, mpif77 or mpif90, e.g.

mpicc –o cpi cpi.c

  • Prepare hostfile (e.g. machines) one hostname per line like previous section

41

compiling running mpi programs42
Compiling & Running MPI Programs
  • Pmdboot with the hostfile

mpdboot –n 4 –f machines

  • Run the program with a number of processor node:

mpiexec –np 4 ./cpi

  • Remember to clean after running jobs by mpdallexit

mpdallexit

42

compiling running mpi programs43
Compiling & Running MPI Programs
  • Using openmpi:1.2
  • Setting environment for openmpi

export LD-LIBRARY_PATH=/u1/local/openmpi/

lib:$LD-LIBRARY_PATH

export PATH=/u1/local/openmpi/bin:$PATH

(uncomment this line in .bashrc)

  • Compile using mpicc, mpiCC, mpif77 or mpif90, e.g.

mpicc –o cpi cpi.c

  • Prepare hostfile (e.g. machines) one hostname per line like previous section
  • Run the program with a number of processor node

mpirun –np 4 –machinefile machines ./cpi

43

submit parallel jobs into torque batch queue
Submit parallel jobs into torque batch queue
    • Prepare a job script (name it runcpi)
    • For program compiled with mvapich 1.1
    • For program compiled with mvapich 1.2
    • For program compiled with openmpi 1.2
    • For GROMACS

(refer to your handout for detail scripts)

  • Submit the above script (gromacs.pbs)

qsub gromacs.pbs

44

example 1 matrix vector multiplication
Example 1: Matrix-vector Multiplication

The figure below demonstrates schematically how a matrix-vector multiplication, A=B*C, can be decomposed into four independent computations involving a scalar multiplying a column vector.

This approach is different from that which is usually taught in a linear algebra course because this decomposition lends itself better to parallelization.

These computations are independent and do not require communication, something that usually reduces performance of parallel code.

46

example 1 matrix vector multiplication column wise
Example 1: Matrix-vector Multiplication (Column wise)

Figure 1: Schematic of parallel decomposition for vector-matrix multiplication, A=B*C. The vector A is depicted in yellow. The matrix B and vector C are depicted in multiple colors representing the portions, columns, and

elements assigned to each processor, respectively.

47

example 1 mv multiplication
Example 1: MV Multiplication

+

+

+

+

+

+

+

+

+

+

+

+

A=B*C

P0

P1

P2

P3

P0

P1

P2

P3

Reduction (SUM)

48

example 2 quick sort
Example 2: Quick Sort

The quick sort is an in-place, divide-and-conquer, massively recursive sort.

The efficiency of the algorithm is majorly impacted by which element is chosen as the pivot point.

The worst-case efficiency of the quick sort, O(n2), occurs when the list is sorted and the left-most element is chosen.

If the data to be sorted isn't random, randomly choosing a pivot point is recommended. As long as the pivot point is chosen randomly, the quick sort has an algorithmic complexity of O(n log n).

Pros: Extremely fast.

Cons: Very complex algorithm, massively recursive

49

example 3 bubble sort
Example 3: Bubble Sort

The bubble sort is the oldest and simplest sort in use. Unfortunately, it's also the slowest.

The bubble sort works by comparing each item in the list with the item next to it, and swapping them if required.

The algorithm repeats this process until it makes a pass all the way through the list without swapping any items (in other words, all items are in the correct order).

This causes larger values to "bubble" to the end of the list while smaller values "sink" towards the beginning of the list.

51

monte carlo integration
Monte Carlo Integration

"Hit and miss" integration

The integration scheme is to take a large number of random points and count the number that are within f(x) to get the area

53

sprng scalable parallel random number generators
SPRNG (Scalable Parallel Random Number Generators )

Monte Carlo Integration to Estimate Pi

for (i=0;i<n;i++) {

x = sprng();

y = sprng();

if ((x*x + y*y) < 1)

in++;

}

pi = ((double)4.0)*in/n;

SPRNG is used to generate different sets of random numbers on different compute nodes while run n parallel

54

slide55

0

Example 2: Prime

prime/prime.c

prime/prime.f90

prime/primeParallel.c

prime/Makefile

prime/machines

Compile by the command: make

Run the serial program by

./prime or ./primef

Run the parallel program by

mpirun –np 4 –machinefile machines ./primeParallel

Example 1: Ring

ring/ring.c

ring/Makefile

ring/machines

Compile program by the command:

make

Run the program in parallel by

mpirun –np 4 –machinefile machines ./ring < in

3

1

2

Example 4: mcPi

mcPi/mcPi.c

mcPi/mc-Pi-mpi.c

mcPi/Makefile

mcPi/QmcPi.pbs

Compile by the command:make

Run the serial program by: ./mcPi ##

Submit job to PBS queuing system by

qsub QmcPi.pbs

Example 3: Sorting

sorting/qsort.c

sorting/bubblesort.c

sorting/script.sh

sorting/qsort

sorting/bubblesort

Submit job to PBS queuing system by

qsub script.sh

55

policy
Policy
  • Every user shall apply for his/her own computer user account to login to the master node of the PC cluster, sciblade.sci.hkbu.edu.hk.
  • The account must not be shared his/her account and password with the other users.
  • Every user must deliver jobs to the PC cluster from the master node via the PBS job queuing system. Automatically dispatching of job using scripts or robots are not allowed.
  • Users are not allowed to login to the compute nodes.
  • Foreground jobs on the PC cluster are restricted to program testing and the time duration should not exceed 1 minutes CPU time per job.
policy continue
Policy (continue)
  • Any background jobs run on the master node or compute nodes are strictly prohibited and will be killed without prior notice.
  • The current restrictions of the job queuing system are as follows,
    • The maximum number of running jobs in the job queue is 8.
    • The maximum total number of CPU cores used in one time cannot exceed 1024.
  • The restrictions in item 5 will be reviewed timely for the growing number of users and the computation need.
contacts
Contacts
  • Discussion mailing list
    • hpccc_hkbu@googlegroups.com
  • Technical Support
    • hpccc@sci.hkbu.edu.hk