1 / 23

Southgreen HPC system

Southgreen HPC system. 2014. Concepts. Cluster : compute farm i.e. a collection of compute servers that can be shared and accessed through a single “portal” (marmadais.cirad.fr)

roxy
Download Presentation

Southgreen HPC system

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Southgreen HPC system 2014

  2. Concepts • Cluster : compute farm i.e. a collection of compute servers that can be shared and accessed through a single “portal” (marmadais.cirad.fr) • Job Manager (SGE) : software that allows you to initiate and/or send jobs to the cluster compute servers (also known as compute hosts or nodes)

  3. Architecture

  4. Usefull links Available woftware : http://gohelle.cirad.fr/cluster/doc_logiciels_cluster.xls Software installation form : http://gohelle.cirad.fr/cluster/logiciel.php

  5. Do not run jobs on marmadais Jobs or programs found running on marmadais could be killed. Submit batch jobs (qsub) or, if really needed, work on a cluster node (qrsh or qlogin)

  6. Compute node characteristics CPU x86_64 cc-adm-calcul-1, …, cc-adm-calcul-23 : 8 cores, 32G RAM cc-adm-calcul-24, …, cc-adm-calcul-26 : 8 cores, 92G RAM cc-adm-calcul-27 : 32 cores, 1To RAM

  7. Type of jobs/programs • Sequential : use only 1 core. • Multi-thread : use multiple CPU cores via program threads on a single cluster node. • Parallel (MPI) : use many CPU cores on many cluster nodes using a Message Passing Interface (MPI) environment.

  8. File systems • home2 : GPFS (on slow disks) • Work : GPFS (on quick disks) • All the rest : NFS

  9. Step by step • Preparation : Data Script • Submit • Job computation • Checkout

  10. Using the cluster All computing on the cluster is done by logging into marmadais.cirad.fr and submitting batch jobs or interactive. Job submission : qsub, qrsh, qsh/qlogin Job management : qmod, qdel, qhold, qrls, qalter Cluster information display : qstat, qhost, qconf Accountig : qacct

  11. Interactive jobs • qrsh establishes a remote shell connection (ssh) on a cluster node which meets your requirements. • qlogin (bigmem.q) establishes a remote shell connection (ssh) with X display export on a cluster node which meets your requirements.

  12. A first look at batch job submission • createa shell script vi script1.sh # create a shell script! • The shell script hostname # show name of cluster node this job is running on! ps-ef # see what’s running (on the cluster node)! sleep 30 # sleep for 30 seconds! echo-e “\n --------- Done”! • submit shell script with a given job name (note use of -N option) qsub-N mytest script1.sh • check the status of your job(s) qstat! • The output (stdout and stderr) goes to files in your home directory lsmytest.o* mytest.e*

  13. A few options • run the job in the current working directory (where qsubwas executed) rather than the default (home directory) -cwd • send standard output (error) stream to a different file -o path/filename -e path/filename • merge the standard error stream into the standard output -j y

  14. Memory limitations Resources are limited and shared … so specify your resource requirements Ask for the amount of memory that you expect your job will need so that it runs on a cluster node with sufficient memory available … AND … place a limit on the amount of memory your job can use so that you don’t accidentally crash a compute node and take down other users’ jobs along with yours: qsub -cwd -l mem_free=6G,h_vmem=8G myscript.sh qrsh -l mem_free=8G,h_vmem=10G If not set, we do it for you. 

  15. Job simpleTemplate : /usr/local/bioinfo/template/job_sequentiel.sh

  16. Job multi-threadTemplate : /usr/local/bioinfo/template/job_multithread.sh

  17. Job MPITemplate : /usr/local/bioinfo/template/job_parallel.sh

  18. Parallel env • parallel_2 • parallel_4 • parallel_8 • parallel_fill • parallel_rr • parallel_smp

  19. How many jobs can I submit ? As many as you want … BUT, only a limited number of “slots” will run. The rest will have the queue wait state ‘qw’ and will run as your other jobs finish. In SGE, a slot generally corresponds to a single cpu-core. The maximum number of slots per user may change depending on the availability of cluster resources or special needs and requests.

  20. queues • bigmem.q • 92Go RAM nodes • Access : everybody • Runtime: no limit • Normal priority • bioinfo.q • 32Go RAM nodes • Access : internal users • Runtime: maximum 48 hours • Normal priority • long.q • For long running jobs (32Go RAM nodes) • Access : everybody • Runtime: no limit • Low priority • normal.q • 32Go RAM nodes • Access : external users • Runtime: maximum 48 hours • Normal priority • urgent.q • Any node • Access : limited • Runtime: maximum 24 hours • Highest priority • hypermem.q • 1To RAM node • Access : everybody • Runtime: no limit • Normal priority

  21. qstat • info for a given user qstat-u username (or just qstator qufor your jobs) • full dump of queue and job status qstat-f • What do the column labels mean? job-ID a unique identifier for the job name the name of the job state the state of the job r: running s: suspended t: being transferred to an execution host qw: queued and waiting to run Eqw: an error occured with the job • Why is my job in Eqw state? qstat-j job-ID -explain E

  22. Visualisation

  23. qdel To terminate a job, first get the job-id with qstat qstat (or qu ) Terminate the job qdel job-id Forced termination of a running job (admins only) qdel -f job-id

More Related