distributed parallel computing cluster
Download
Skip this Video
Download Presentation
Distributed & Parallel Computing Cluster

Loading in 2 Seconds...

play fullscreen
1 / 42

Distributed Parallel Computing Cluster - PowerPoint PPT Presentation


  • 91 Views
  • Uploaded on

Distributed & Parallel Computing Cluster. Patrick McGuigan [email protected] DPCC Background. NSF funded Major Research Instrumentation (MRI) grant Goals Personnel PI Co-PI’s Senior Personnel Systems Administrator. DPCC Goals .

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Distributed Parallel Computing Cluster' - svea


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
dpcc background
DPCC Background
  • NSF funded Major Research Instrumentation (MRI) grant
  • Goals
  • Personnel
    • PI
    • Co-PI’s
    • Senior Personnel
    • Systems Administrator

2/12/04

dpcc goals
DPCC Goals
  • Establish a regional Distributed and Parallel Computing Cluster at UTA ([email protected])
  • An inter-departmental and inter-institutional facility
  • Facilitate collaborative research that requires large scale storage (tera to peta bytes), high speed access (gigabit or more) and mega processing (100’s of processors)

2/12/04

dpcc research areas
DPCC Research Areas
  • Data mining / KDD
    • Association rules, graph-mining, Stream processing etc.
  • High Energy Physics
    • Simulation, moving towards a regional Dø Center
  • Dermatology/skin cancer
    • Image database, lesion detection and monitoring
  • Distributed computing
    • Grid computing, PICO
  • Networking
    • Non-intrusive network performance evaluation
  • Software Engineering
    • Formal specification and verification
  • Multimedia
    • Video streaming, scene analysis
  • Facilitate collaborative efforts that need high-performance computing

2/12/04

dpcc personnel
DPCC Personnel
  • PI Dr. Chakravarthy
  • CO-PI’s
    • Drs. Aslandogan, Das, Holder, Yu
  • Senior Personnel
    • Paul Bergstresser, Kaushik De, Farhad Kamangar, David Kung, Mohan Kumar, David Levine, Jung-Hwan Oh, Gregely Zaruba
  • Systems Administrator
    • Patrick McGuigan

2/12/04

dpcc components
DPCC Components
  • Establish a distributed memory cluster (150+ processors)
  • Establish a Symmetric or shared multiprocessor system
  • Establish a large shareable high speed storage (100’s of Terabytes)

2/12/04

dpcc cluster as of 2 1 2004
DPCC Cluster as of 2/1/2004
  • Located in 101 GACB
  • Inauguration 2/23 as part of E-Week
  • 5 racks of equipment + UPS

2/12/04

dpcc resources
DPCC Resources
  • 97 machines
    • 81 worker nodes
    • 2 interactive nodes
    • 10 IDE based RAID servers
    • 4 nodes support Fibre Channel SAN
  • 50+ TB storage
    • 4.5 TB in each IDE RAID
    • 5.2 TB in FC SAN

2/12/04

dpcc resources continued
DPCC Resources (continued)
  • 1 Gb/s network interconnections
    • core switch
    • satellite switches
  • 1 Gb/s SAN network
  • UPS

2/12/04

dpcc resource details
DPCC Resource Details
  • Worker nodes
    • Dual Xeon processors
      • 32 machines @ 2.4GHz
      • 49 machines @ 2.6GHz
    • 2 GB RAM
    • IDE Storage
      • 32 machines @ 60 GB
      • 49 machines @ 80 GB
    • Redhat 7.3 Linux (2.4.20 kernel)

2/12/04

dpcc resource details cont
DPCC Resource Details (cont.)
  • Raid Server
    • Dual Xeon processors (2.4 GHz)
    • 2 GB RAM
    • 4 Raid Controllers
      • 2 port controller (qty 1) Mirrored OS disks
      • 8 port controller (qty 3) RAID5 with hot spare
    • 24 250GB disks
    • 2 40GB disk
    • NFS used to support worker nodes

2/12/04

dpcc resource details cont1
DPCC Resource Details (cont.)
  • FC SAN
    • RAID5 Array
    • 42 142GB FC disks
    • FC Switch
    • 3 GFS nodes
      • Dual Xeon (2.4 Ghz)
      • 2 GB RAM
      • Global File System (GFS)
      • Serve to cluster via NFS
    • 1 GFS Lockserver

2/12/04

using dpcc
Using DPCC
  • Two nodes available for interactive use
    • master.dpcc.uta.edu
    • grid.dpcc.uta.edu
  • More nodes are likely to support other services (Web, DB access)
  • Access through SSH (version 2 client)
    • Freeware Windows clients are available (ssh.com)
    • File transfers through SCP/SFTP

2/12/04

using dpcc continued
Using DPCC (continued)
  • User quotas not implemented on home directory yet. Be sensible in your usage.
  • Large data sets will be stored on RAID’s (requires coordination with sys admin)
  • All storage visible to all nodes.

2/12/04

getting accounts
Getting Accounts
  • Have your supervisor request account
  • Account will be created
  • Bring ID to 101 GACB to receive password
    • Keep password safe
  • Login to any interactive machine
    • master.dpcc.uta.edu
    • grid.dpcc.uta.edu
  • USE yppasswd command to change password
  • If you forget your password
    • See me in my office, I will reset your password (with ID)
    • Call or e-mail me, I will reset your password to the original password

2/12/04

user environment
User environment
  • Default shell is bash
    • Change with ypchsh
    • Customize user environment using startup files
      • .bash_profile (login session)
      • .bashrc (non-login)
    • Customize with statements like:
      • export <variable>=<value>
      • source <shell file>
    • Much more information in man page

2/12/04

program development tools
Program development tools
  • GCC 2.96
    • C
    • C++
    • Java (gcj)
    • Objective C
    • Chill
  • Java
    • JDK Sun J2SDK 1.4.2

2/12/04

development tools cont
Development tools (cont.)
  • Python
    • python = version 1.5.2
    • python2 = version 2.2.2
  • Perl
    • Version 5.6.1
  • Flex, Bison, gdb…
  • If your favorite tool is not available, we’ll consider adding it!

2/12/04

batch queue system
Batch Queue System
  • OpenPBS
    • Server runs on master
    • pbs_mom runs on worker nodes
    • Scheduler runs on master
    • Jobs can be submitted from any interactive node
    • User commands
      • qsub – submit a job for execution
      • qstat – determine status of job, queue, server
      • qdel – delete a job from the queue
      • qalter – modify attributes of a job
    • Single queue (workq)

2/12/04

pbs qsub
PBS qsub
  • qsub used to submit jobs to PBS
    • A job is represented by a shell script
    • Shell script can alter environment and proceed with execution
    • Script may contain embedded PBS directives
    • Script is responsible for starting parallel jobs (not PBS)

2/12/04

hello world
Hello World

[[email protected] pbs_examples]$ cat helloworld

echo Hello World from $HOSTNAME

[[email protected] pbs_examples]$ qsub helloworld

15795.master.cluster

[[email protected] pbs_examples]$ ls

helloworld helloworld.e15795 helloworld.o15795

[[email protected] pbs_examples]$ more helloworld.o15795

Hello World from node1.cluster

2/12/04

hello world continued
Hello World (continued)
  • Job ID is returned from qsub
  • Default attributes allow job to run
    • 1 Node
    • 1 CPU
    • 36:00:00 CPU time
  • standard out and standard error streams are returned

2/12/04

hello world continued1
Hello World (continued)
  • Environment of job
    • Defaults to login shell (overide with #!) or –S switch
    • Login environment variable list with PBS additions:
      • PBS_O_HOST
      • PBS_O_QUEUE
      • PBS_O_WORKDIR
      • PBS_ENVIRONMENT
      • PBS_JOBID
      • PBS_JOBNAME
      • PBS_NODEFILE
      • PBS_QUEUE
    • Additional environment variables may be transferred using “-v” switch

2/12/04

pbs environment variables
PBS Environment Variables
  • PBS_ENVIRONMENT
  • PBS_JOBCOOKIE
  • PBS_JOBID
  • PBS_JOBNAME
  • PBS_MOMPORT
  • PBS_NODENUM
  • PBS_O_HOME
  • PBS_O_HOST
  • PBS_O_LANG
  • PBS_O_LOGNAME
  • PBS_O_MAIL
  • PBS_O_PATH
  • PBS_O_QUEUE
  • PBS_O_SHELL
  • PBS_O_WORKDIR
  • PBS_QUEUE=workq
  • PBS_TASKNUM=1

2/12/04

qsub options
qsub options
  • Output streams:
    • -e (error output path)
    • -o (standard output path)
    • -j (join error + output as either output or error)
  • Mail options
    • -m [aben] when to mail (abort, begin, end, none)
    • -M who to mail
  • Name of job
    • -N (15 printable characters MAX first is alphabetical)
  • Which queue to submit job to
    • -q [name] Unimportant for now
  • Environment variables
    • -v pass specific variables
    • -V pass all environment variables of qsub to job
  • Additional attributes

-w specify dependencies

2/12/04

qsub options continued
Qsub options (continued)
  • -l switch used to specify needed resources
    • Number of nodes
      • nodes = x
    • Number of processors
      • ncpus = x
    • CPU time
      • cput=hh:mm:ss
    • Walltime
      • walltime=hh:mm:ss
  • See man page for pbs_resources

2/12/04

hello world1
Hello World
  • qsub –l nodes=1 –l ncpus=1 –l cput=36:00:00 –N helloworld –m a –q workq helloworld
  • Options can be included in script:

#PBS -l nodes=1

#PBS -l ncpus=1

#PBS -m a

#PBS -N helloworld2

#PBS -l cput=36:00:00

echo Hello World from $HOSTNAME

2/12/04

qstat
qstat
  • Used to determine status of jobs, queues, server
  • $ qstat
  • $ qstat <job id>
  • Switches
    • -u <user> list jobs of user
    • -f provides extra output
    • -n provides nodes given to job
    • -q status of the queue
    • -i show idle jobs

2/12/04

qdel qalter
qdel & qalter
  • qdel used to remove a job from a queue
    • qdel <job ID>
  • qalter used to alter attributes of currently queued job
    • qalter <job id> attributes (similar to qsub)

2/12/04

processing on a worker node
Processing on a worker node
  • All RAID storage visible to all nodes
    • /dataxy where x is raid ID, y is Volume (1-3)
    • /gfsx where x is gfs volume (1-3)
  • Local storage on each worker node
    • /scratch
  • Data intensive applications should copy input data (when possible) to /scratch for manipulation and copy results back to raid storage

2/12/04

parallel processing
Parallel Processing
  • MPI installed on interactive + worker nodes
    • MPICH 1.2.5
    • Path: /usr/local/mpich-1.2.5
  • Asking for multiple processors
    • -l nodes=x
    • -l ncpus=2x

2/12/04

parallel processing continued
Parallel Processing (continued)
  • PBS node file created when job executes
  • Available to job via $PBS_NODEFILE
  • Used to start processes on remote nodes
    • mpirun
    • rsh

2/12/04

using node file example job
Using node file (example job)

#!/bin/sh

#PBS -m n

#PBS -l nodes=3:ppn=2

#PBS -l walltime=00:30:00

#PBS -j oe

#PBS -o helloworld.out

#PBS -N helloword_mpi

NN=`cat $PBS_NODEFILE | wc -l`

echo "Processors received = "$NN

echo "script running on host `hostname`"

cd $PBS_O_WORKDIR

echo

echo "PBS NODE FILE"

cat $PBS_NODEFILE

echo

/usr/local/mpich-1.2.5/bin/mpirun -machinefile $PBS_NODEFILE -np $NN ~/mpi-example/helloworld

2/12/04

slide39
MPI
  • Shared Memory vs. Message Passing
  • MPI
    • C based library to allow programs to communicate
    • Each cooperating execution is running the same program image
    • Different images can do different computations based on notion of rank
    • MPI primitives allow for construction of more sophisticated synchronization mechanisms (barrier, mutex)

2/12/04

helloworld c
helloworld.c

#include <stdio.h>

#include <unistd.h>

#include <string.h>

#include "mpi.h"

int main( argc, argv )

int argc;

char **argv;

{

int rank, size;

char host[256];

int val;

val = gethostname(host,255);

if ( val != 0 ){

strcpy(host,"UNKNOWN");

}

MPI_Init( &argc, &argv );

MPI_Comm_size( MPI_COMM_WORLD, &size );

MPI_Comm_rank( MPI_COMM_WORLD, &rank );

printf( "Hello world from node %s: process %d of %d\n", host, rank, size );

MPI_Finalize();

return 0;

}

2/12/04

using mpi programs
Using MPI programs
  • Compiling
    • $ /usr/local/mpich-1.2.5/bin/mpicc helloworld.c
  • Executing
    • $ /usr/local/mpich-1.2.5/bin/mpirun <options> \ helloworld
    • Common options:
      • -np number of processes to create
      • -machinefile list of nodes to run on

2/12/04

resources for mpi
Resources for MPI
  • http://www-hep.uta.edu/~mcguigan/dpcc/mpi
    • MPI documentation
  • http://www-unix.mcs.anl.gov/mpi/indexold.html
    • Links to various tutorials
  • Parallel programming course

2/12/04

ad