Cluster computing
Download
1 / 59

Cluster Computing Kick-start seminar - PowerPoint PPT Presentation


  • 725 Views
  • Updated On :

Cluster Computing. Kick-start seminar 16 December, 2009 High Performance Cluster Computing Centre (HPCCC) Faculty of Science Hong Kong Baptist University. Outline. Overview of Cluster Hardware and Software Basic Login and Running Program in a job queuing system

Related searches for Cluster Computing Kick-start seminar

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Cluster Computing Kick-start seminar' - Leo


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Cluster computing l.jpg

Cluster Computing

Kick-start seminar

16 December, 2009

High Performance Cluster Computing Centre (HPCCC)

Faculty of Science

Hong Kong Baptist University


Outline l.jpg
Outline

Overview of Cluster Hardware and Software

Basic Login and Running Program in a job queuing system

Introduction to Parallelism

Why Parallelism & Cluster Parallelism

Message Passing Interface

Parallel Program Examples

Policy for using sciblade.sci.hkbu.edu.hk

Contact

http://www.sci.hkbu.edu.hk/hpccc2009/sciblade

2


Overview of cluster hardware and software l.jpg
Overview of Cluster Hardware and Software


Cluster hardware l.jpg
Cluster Hardware

This 256-node PC cluster (sciblade) consist of:

Master node x 2

IO nodes x 3 (storage)

Compute nodes x 256

Blade Chassis x 16

Management network

Interconnect fabric

1U console & KVM switch

Emerson Liebert Nxa 120k VA UPS

4


Sciblade cluster l.jpg
Sciblade Cluster

256-node clusters supported by fund from RGC

5


Hardware configuration l.jpg
Hardware Configuration

Master Node

Dell PE1950, 2x Xeon E5450 3.0GHz (Quad Core)

16GB RAM, 73GB x 2 SAS drive

IO nodes (Storage)

Dell PE2950, 2x Xeon E5450 3.0GHz (Quad Core)

16GB RAM, 73GB x 2 SAS drive

3TB storage Dell PE MD3000

Compute nodes x 256 each

Dell PE M600 blade server w/ Infiniband network

2x Xeon E5430 2.66GHz (Quad Core)

16GB RAM, 73GB SAS drive

6


Hardware configuration7 l.jpg
Hardware Configuration

Blade Chassis x 16

Dell PE M1000e

Each hosts 16 blade servers

Management Network

Dell PowerConnect 6248 (Gigabit Ethernet) x 6

Inerconnect fabric

Qlogic SilverStorm 9120 switch

Console and KVM switch

Dell AS-180 KVM

Dell 17FP Rack console

Emerson Liebert Nxa 120kVA UPS

7


Software list l.jpg
Software List

Operating System

ROCKS 5.1 Cluster OS

CentOS 5.3 kernel 2.6.18

Job Management System

Portable Batch System

MAUI scheduler

Compilers, Languages

Intel Fortran/C/C++ Compiler for Linux V11

GNU 4.1.2/4.4.0 Fortran/C/C++ Compiler

8


Software list9 l.jpg
Software List

Message Passing Interface (MPI) Libraries

MVAPICH 1.1

MVAPICH2 1.2

OPEN MPI 1.3.2

Mathematic libraries

ATLAS 3.8.3

FFTW 2.1.5/3.2.1

SPRNG 2.0a(C/Fortran)

9


Software list10 l.jpg
Software List

  • Molecular Dynamics & Quantum Chemistry

    • Gromacs 4.0.5

    • Gamess 2009R1

    • Namd 2.7b1

  • Third-party Applications

    • GROMACS, NAMD

    • MATLAB 2008b

    • TAU 2.18.2, VisIt 1.11.2

    • VMD 1.8.6, Xmgrace 5.1.22

    • etc

10


Software list11 l.jpg
Software List

Queuing system

Torque/PBS

Maui scheduler

Editors

vi

emacs

11


Hostnames l.jpg
Hostnames

Master node

External : sciblade.sci.hkbu.edu.hk

Internal : frontend-0

IO nodes (storage)

pvfs2-io-0-0, pvfs2-io-0-1, pvfs-io-0-2

Compute nodes

compute-0-0.local, …, compute-0-255.local

12


Basic login and running program in a job queuing system l.jpg

Basic Login and Running Program in a Job Queuing System


Basic login l.jpg
Basic login

Remote login to the master node

Terminal login

using secure shell

ssh -l username sciblade.sci.hkbu.edu.hk

Graphical login

PuTTY & vncviewer e.g.

[[email protected]]$ vncserver

New ‘sciblade.sci.hkbu.edu.hk:3 (username)' desktop is sciblade.sci.hkbu.edu.hk:3

It means that your session will run on display 3.

14


Graphical login l.jpg
Graphical login

Using PuTTY to setup a secured connection: Host Name=sciblade.sci.hkbu.edu.hk

15


Graphical login con t l.jpg
Graphical login (con’t)

ssh protocol version

16


Graphical login con t17 l.jpg
Graphical login (con’t)

Port 5900 +display numbe (i.e. 3 in this case)

17


Graphical login con t18 l.jpg
Graphical login (con’t)

Next, click Open, and login to sciblade

Finally, run VNC Viewer on your PC, and enter "localhost:3" {3 is the display number}

You should terminate your VNC session after you have finished your work. To terminate your VNC session running on sciblade, run the command

[[email protected]] $ vncserver –kill : 3

18


Linux commands l.jpg
Linux commands

Both master and compute nodes are installed with Linux

Frequently used Linux command in PC cluster http://www.sci.hkbu.edu.hk/hpccc2009/sciblade/faq_sciblade.php

19


Rocks specific commands l.jpg
ROCKS specific commands

ROCKS provides the following commands for users to run programs in all compute node. e.g.

cluster-fork

Run program in all compute nodes

cluster-fork ps

Check user process in each compute node

cluster-kill

Kill user process at one time

tentakel

Similar to cluster-fork but run faster

20


Ganglia l.jpg
Ganglia

  • Web based management and monitoring

  • http://sciblade.sci.hkbu.edu.hk/ganglia

21



Why parallelism passively l.jpg
Why Parallelism – Passively

Suppose you are using the most efficient algorithm with an optimal implementation, but the program still takes too long or does not even fit onto your machine

Parallelization is the last chance.

23


Why parallelism initiative l.jpg
Why Parallelism – Initiative

Faster

Finish the work earlier

Same work in shorter time

Do more work

More work in the same time

Most importantly, you want to predict the result before the event occurs

24


Examples l.jpg
Examples

Many of the scientific and engineering problems require enormous computational power.

Following are the few fields to mention:

Quantum chemistry, statistical mechanics, and relativistic physics

Cosmology and astrophysics

Computational fluid dynamics and turbulence

Material design and superconductivity

Biology, pharmacology, genome sequencing, genetic engineering, protein folding, enzyme activity, and cell modeling

Medicine, and modeling of human organs and bones

Global weather and environmental modeling

Machine Vision

25


Parallelism l.jpg
Parallelism

The upper bound for the computing power that can be obtained from a single processor is limited by the fastest processor available at any certain time.

The upper bound for the computing power available can be dramatically increased by integrating a set of processors together.

Synchronization and exchange of partial results among processors are therefore unavoidable.

26


Multiprocessing clustering l.jpg
Multiprocessing Clustering

LM

LM

LM

LM

CU

CU

CU

CU

n

1

2

n-1

2

1

n-1

n

DS

DS

DS

DS

IS

IS

IS

IS

CPU

CPU

CPU

CPU

PU

PU

PU

PU

1

2

n-1

n

n

1

2

n-1

DS

DS

DS

DS

Interconnecting Network

Shared Memory

Parallel Computer Architecture

Distributed Memory –

Cluster

Shared Memory –

Symmetric multiprocessors (SMP)

27


Clustering pros and cons l.jpg
Clustering: Pros and Cons

Advantages

Memory scalable to number of processors.

∴Increase number of processors, size of memory and bandwidth as well.

Each processor can rapidly access its own memory without interference

Disadvantages

Difficult to map existing data structures to this memory organization

User is responsible for sending and receiving data among processors

28




Parallel programming paradigm l.jpg
Parallel Programming Paradigm

  • Multithreading

    • OpenMP

  • Message Passing

    • MPI (Message Passing Interface)

    • PVM (Parallel Virtual Machine)

Shared memory only

Shared memory, Distributed memory

31


Distributed memory l.jpg
Distributed Memory

Programmers view:

Several CPUs

Several block of memory

Several threads of action

Parallelization

Done by hand

Example

MPI

Serial

Process

P1

P2

P3

P1

Process 0

Data exchange via

interconnection

Message

Passing

P2

Process 1

P3

Process 2

time

32


Message passing model l.jpg
Message Passing Model

P1

P2

P3

Serial

Process 0

P1

Process 1

P2

Message

Passing

Process 2

P3

time

Data exchange

Process

A process is a set of executable instructions (program) which runs on a processor.

Message passing systems generally associate only one process per processor, and the terms "processes" and "processors" are used interchangeably

Message Passing

The method by which data from one processor's memory is copied to the memory of another processor.

33



Slide35 l.jpg
MPI

Is a library but not a language, for parallel programming

An MPI implementation consists of

a subroutine library with all MPI functions

include files for the calling application program

some startup script (usually called mpirun, but not standardized)

Include the lib file mpi.h (or however called) into the source code

Libraries available for all major imperative languages (C, C++, Fortran …)

35


General mpi program structure l.jpg
General MPI Program Structure

MPI include file

#include <mpi.h>

void main (int argc, char *argv[])

{

int np, rank, ierr;

ierr = MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD,&rank);

MPI_Comm_size(MPI_COMM_WORLD,&np);

/* Do Some Works */

ierr = MPI_Finalize();

}

#include <mpi.h>

void main (int argc, char *argv[])

{

int np, rank, ierr;

ierr = MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD,&rank);

MPI_Comm_size(MPI_COMM_WORLD,&np);

/* Do Some Works */

ierr = MPI_Finalize();

}

#include <mpi.h>

void main (int argc, char *argv[])

{

int np, rank, ierr;

ierr = MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD,&rank);

MPI_Comm_size(MPI_COMM_WORLD,&np);

/* Do Some Works */

ierr = MPI_Finalize();

}

variable declarations

#include <mpi.h>

void main (int argc, char *argv[])

{

int np, rank, ierr;

ierr = MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD,&rank);

MPI_Comm_size(MPI_COMM_WORLD,&np);

/* Do Some Works */

ierr = MPI_Finalize();

}

#include <mpi.h>

void main (int argc, char *argv[])

{

int np, rank, ierr;

ierr = MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD,&rank);

MPI_Comm_size(MPI_COMM_WORLD,&np);

/* Do Some Works */

ierr = MPI_Finalize();

}

Initialize MPI environment

Do work and make

message passing calls

Terminate MPI Environment

36


Sample program hello world l.jpg
Sample Program: Hello World!

In this modified version of the "Hello World" program, each processor prints its rank as well as the total number of processors in the communicator MPI_COMM_WORLD.

Notes:

Makes use of the pre-defined communicator MPI_COMM_WORLD.

Not testing for error status of routines!

37


Sample program hello world38 l.jpg
Sample Program: Hello World!

#include <stdio.h>

#include “mpi.h” // MPI compiler header file

void main(int argc, char **argv)

{

int nproc,myrank,ierr;

ierr=MPI_Init(&argc,&argv); // MPI initialization

// Get number of MPI processes

MPI_Comm_size(MPI_COMM_WORLD,&nproc);

// Get process id for this processor

MPI_Comm_rank(MPI_COMM_WORLD,&myrank);

printf (“Hello World!! I’m process %d of %d\n”,myrank,nproc);

ierr=MPI_Finalize(); // Terminate all MPI processes

}

38


Performance l.jpg
Performance

When we write a parallel program, it is important to identify the fraction of the program that can be parallelized and to maximize it.

The goals are:

load balance

memory usage balance

minimize communication overhead

reduce sequential bottlenecks

scalability

39


Compiling running mpi programs l.jpg
Compiling & Running MPI Programs

  • Using mvapich 1.1

  • Setting path, at the command prompt, type:

    export PATH=/u1/local/mvapich1/bin:$PATH

    (uncomment this line in .bashrc)

  • Compile using mpicc, mpiCC, mpif77 or mpif90, e.g.

    mpicc –o cpi cpi.c

  • Prepare hostfile (e.g. machines) number of compute nodes:

    compute-0-0

    compute-0-1

    compute-0-2

    compute-0-3

  • Run the program with a number of processor node:

    mpirun –np 4 –machinefile machines ./cpi

40


Compiling running mpi programs41 l.jpg
Compiling & Running MPI Programs

  • Using mvapich 1.2

  • Prepare .mpd.conf and .mpd.passwd and saved in your home directory :

    MPD_SECRETWORD=gde1234-3

    (you may set your own secret word)

  • Setting environment for mvapich 1.2

    export MPD_BIN=/u1/local/mvapich2

    export PATH=$MPD_BIN:$PATH

    (uncomment this line in .bashrc)

  • Compile using mpicc, mpiCC, mpif77 or mpif90, e.g.

    mpicc –o cpi cpi.c

  • Prepare hostfile (e.g. machines) one hostname per line like previous section

41


Compiling running mpi programs42 l.jpg
Compiling & Running MPI Programs

  • Pmdboot with the hostfile

    mpdboot –n 4 –f machines

  • Run the program with a number of processor node:

    mpiexec –np 4 ./cpi

  • Remember to clean after running jobs by mpdallexit

    mpdallexit

42


Compiling running mpi programs43 l.jpg
Compiling & Running MPI Programs

  • Using openmpi:1.2

  • Setting environment for openmpi

    export LD-LIBRARY_PATH=/u1/local/openmpi/

    lib:$LD-LIBRARY_PATH

    export PATH=/u1/local/openmpi/bin:$PATH

    (uncomment this line in .bashrc)

  • Compile using mpicc, mpiCC, mpif77 or mpif90, e.g.

    mpicc –o cpi cpi.c

  • Prepare hostfile (e.g. machines) one hostname per line like previous section

  • Run the program with a number of processor node

    mpirun –np 4 –machinefile machines ./cpi

43


Submit parallel jobs into torque batch queue l.jpg
Submit parallel jobs into torque batch queue

  • Prepare a job script (name it runcpi)

  • For program compiled with mvapich 1.1

  • For program compiled with mvapich 1.2

  • For program compiled with openmpi 1.2

  • For GROMACS

    (refer to your handout for detail scripts)

  • Submit the above script (gromacs.pbs)

    qsub gromacs.pbs

  • 44



    Example 1 matrix vector multiplication l.jpg
    Example 1: Matrix-vector Multiplication

    The figure below demonstrates schematically how a matrix-vector multiplication, A=B*C, can be decomposed into four independent computations involving a scalar multiplying a column vector.

    This approach is different from that which is usually taught in a linear algebra course because this decomposition lends itself better to parallelization.

    These computations are independent and do not require communication, something that usually reduces performance of parallel code.

    46


    Example 1 matrix vector multiplication column wise l.jpg
    Example 1: Matrix-vector Multiplication (Column wise)

    Figure 1: Schematic of parallel decomposition for vector-matrix multiplication, A=B*C. The vector A is depicted in yellow. The matrix B and vector C are depicted in multiple colors representing the portions, columns, and

    elements assigned to each processor, respectively.

    47


    Example 1 mv multiplication l.jpg
    Example 1: MV Multiplication

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    A=B*C

    P0

    P1

    P2

    P3

    P0

    P1

    P2

    P3

    Reduction (SUM)

    48


    Example 2 quick sort l.jpg
    Example 2: Quick Sort

    The quick sort is an in-place, divide-and-conquer, massively recursive sort.

    The efficiency of the algorithm is majorly impacted by which element is chosen as the pivot point.

    The worst-case efficiency of the quick sort, O(n2), occurs when the list is sorted and the left-most element is chosen.

    If the data to be sorted isn't random, randomly choosing a pivot point is recommended. As long as the pivot point is chosen randomly, the quick sort has an algorithmic complexity of O(n log n).

    Pros: Extremely fast.

    Cons: Very complex algorithm, massively recursive

    49



    Example 3 bubble sort l.jpg
    Example 3: Bubble Sort

    The bubble sort is the oldest and simplest sort in use. Unfortunately, it's also the slowest.

    The bubble sort works by comparing each item in the list with the item next to it, and swapping them if required.

    The algorithm repeats this process until it makes a pass all the way through the list without swapping any items (in other words, all items are in the correct order).

    This causes larger values to "bubble" to the end of the list while smaller values "sink" towards the beginning of the list.

    51



    Monte carlo integration l.jpg
    Monte Carlo Integration

    "Hit and miss" integration

    The integration scheme is to take a large number of random points and count the number that are within f(x) to get the area

    53


    Sprng scalable parallel random number generators l.jpg
    SPRNG (Scalable Parallel Random Number Generators )

    Monte Carlo Integration to Estimate Pi

    for (i=0;i<n;i++) {

    x = sprng();

    y = sprng();

    if ((x*x + y*y) < 1)

    in++;

    }

    pi = ((double)4.0)*in/n;

    SPRNG is used to generate different sets of random numbers on different compute nodes while run n parallel

    54


    Slide55 l.jpg

    0

    Example 2: Prime

    prime/prime.c

    prime/prime.f90

    prime/primeParallel.c

    prime/Makefile

    prime/machines

    Compile by the command: make

    Run the serial program by

    ./prime or ./primef

    Run the parallel program by

    mpirun –np 4 –machinefile machines ./primeParallel

    Example 1: Ring

    ring/ring.c

    ring/Makefile

    ring/machines

    Compile program by the command:

    make

    Run the program in parallel by

    mpirun –np 4 –machinefile machines ./ring < in

    3

    1

    2

    Example 4: mcPi

    mcPi/mcPi.c

    mcPi/mc-Pi-mpi.c

    mcPi/Makefile

    mcPi/QmcPi.pbs

    Compile by the command:make

    Run the serial program by: ./mcPi ##

    Submit job to PBS queuing system by

    qsub QmcPi.pbs

    Example 3: Sorting

    sorting/qsort.c

    sorting/bubblesort.c

    sorting/script.sh

    sorting/qsort

    sorting/bubblesort

    Submit job to PBS queuing system by

    qsub script.sh

    55



    Policy l.jpg
    Policy

    • Every user shall apply for his/her own computer user account to login to the master node of the PC cluster, sciblade.sci.hkbu.edu.hk.

    • The account must not be shared his/her account and password with the other users.

    • Every user must deliver jobs to the PC cluster from the master node via the PBS job queuing system. Automatically dispatching of job using scripts or robots are not allowed.

    • Users are not allowed to login to the compute nodes.

    • Foreground jobs on the PC cluster are restricted to program testing and the time duration should not exceed 1 minutes CPU time per job.


    Policy continue l.jpg
    Policy (continue)

    • Any background jobs run on the master node or compute nodes are strictly prohibited and will be killed without prior notice.

    • The current restrictions of the job queuing system are as follows,

      • The maximum number of running jobs in the job queue is 8.

      • The maximum total number of CPU cores used in one time cannot exceed 1024.

    • The restrictions in item 5 will be reviewed timely for the growing number of users and the computation need.


    Contacts l.jpg
    Contacts

    • Discussion mailing list

      • [email protected]

    • Technical Support

      • [email protected]


    ad