1 / 47

Ian C. Smith*

Introduction to research computing using Condor. Ian C. Smith*. *Advanced Research Computing University of Liverpool. Overview. what is Condor and what can it be used for ? typical Condor pool operation University of Liverpool Condor Pool support for MATLAB and R applications

kynan
Download Presentation

Ian C. Smith*

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to research computing using Condor Ian C. Smith* *Advanced Research Computing University of Liverpool

  2. Overview • what is Condor and what can it be used for ? • typical Condor pool operation • University of Liverpool Condor Pool • support for MATLAB and R applications • some research computing examples • quick introduction to UNIX with a walk-through example

  3. What is Condor ? • a specialized system for delivering High Throughput Computing • a harvester of unused computing resources • developed by Computer Science Dept at University of Wisconsin in late ‘80s • free and (now) open source software • widely used in academia and increasing in industry • available for many platforms: Linux, Solaris, AIX, Windows XP/Vista/7, Mac OS

  4. Types of Condor application • typically - large numbers of independent calculations (“pleasantly parallel”) • data parallel applications – split large datasets into smaller parts and process them in parallel • biological sequence analysis (e.g. BLAST) • processing of field trial data • optimisation problems • microprocessor design and testing • applications based on Monte Carlo methods • radiotherapy treatment analysis • epidemiological studies

  5. A “typical” Condor pool Desktop PC Condor Server login and upload input data Execute hosts Execute hosts

  6. A “typical” Condor pool Desktop PC Condor Server jobs jobs Execute hosts Execute hosts

  7. A “typical” Condor pool Desktop PC Condor Server results results Execute hosts Execute hosts

  8. A “typical” Condor pool Desktop PC Condor Server download results Execute hosts Execute hosts

  9. University of Liverpool Condor Pool • contains around 700 classroom PCs running the CSD Managed Windows 7 Service (mostly 64 bit from next year) • most have 2.33 GHz Intel Core 2 processors with 2 GB RAM, 80 GB disk, configured with two job slots per PC (total of 1400 job slots) • single job submission point for Condor jobs provided by powerful UNIX server • jobs continue to run while classroom PCs are unused but ... • if load (or memory use) becomes significant, job will be killed and usually any results will be lost (job will start again from scratch) • tools provided for running large numbers of MATLAB and R jobs

  10. Condor caveats • only suitable for non-interactive applications • no communication between jobs possible • all files needed by application must be present on local disk • shorter jobs more likely to run to completion (10-20 min seems to work best) • long running jobs can be run if save/restore mechanism (checkpointing) is built into them • tricky to begin with but usually worth the initial effort

  11. Running MATLAB jobs under Condor • need to create standalone application from M-file(s) using MATLAB compiler • standalone application can run without a MATLAB license • run-time libraries still need to be accessible to MATLAB jobs • nearly all toolbox functions available to standalone applications • simple (but powerful) file input/output makes checkpointing easier • tools available to simplify job submission - see Liverpool Condor website for more information

  12. Running R jobs under Condor • limited support at present • R is installed on-the-fly as part of the job • currently only R version 2.6.2 available with standard packages • tools available to simplify job submission • checkpointingmay be possible for long running jobs

  13. Personalised Medicine example • project is a Genome-Wide Association Study • aims to identify genetic predictors of response to anti-epileptic drugs • try to identify regions of the human genome that differ between individuals (referred to as SNPs) • 800 patients genotyped at 500 000 SNPs along the entire genome • test statistically the association between SNPs and outcomes (e.g. time to withdrawal of drug due to adverse effects) • very large data-parallel problem using R – ideal for Condor • divide datasets into small partitions so that individual jobs run for 15-30 minutes • batch of 26 chromosomes (2 600 jobs) required ~ 5 hours wallclock time on Condor but ~ 5 weeks on a single PC

  14. Radiotherapy example • large 3rd party application code which simulates photon beam radiotherapy treatment using Monte Carlo methods • tried running simulation on 56 cores of high performance computing cluster but no progress after 5 weeks • divided problem into 250 then 5 000 and eventually 50 000 Condor jobs • required ~ 2 600 days of cpu time (equivalent to ~ 3.5 years on dual core PC) • Condor simulation completed in less than one week • average run time was ~ 70 min • only ~ 10 % of compute time wasted due to evictions

  15. Condor service prerequisites • will need a Sun UNIX service account (contact CSD helpdesk@liv.ac.uk) and a Condor account (http://www.liv.ac.uk/csd/registration/eScienceform.pdf) • to login in to the Condor server: • on MWS use PuTTy: Install University Applications | Internet | PuTTy 0.60 • Mac/Linux: open terminal window and use ssh • off campus: use Apps Anywhere (PuTTy is in Utilities group) • to upload/download files to/from the Condor server: • on MWS use CoreFTPLite: Install University Applications | Internet | CoreFTP LE2.1 • Mac/Linux: open terminal window, use sftp/scp • off campus: need to use virtual private network (VPN), then FTP

  16. PuTTy login

  17. PuTTy login

  18. PuTTy login

  19. CoreFTPLite

  20. CoreFTPLite

  21. CoreFTPLite

  22. CoreFTPLite

  23. CoreFTPLite

  24. CoreFTPLite

  25. CoreFTPLite – download files

  26. CoreFTPLite – download files

  27. Condor server directory tree / or ‘root’ /condor_data /usr /bin /sbin /home /tmp

  28. Condor server directory tree / /condor_data /usr /bin /sbin /tmp /home /home/smithic /home/jim /home/fred login ‘home’directories

  29. Condor server directory tree / /usr /bin /sbin /home /tmp /condor_data /condor_data/smithic /condor_data/jim ‘home’directories for Condor

  30. MATLAB Condor example calculate the sum of p matrix-matrix products: • each product calculation is independent and can be performed in parallel • MATLAB M-file (product.m): function product load input.mat; C=A*B; save( 'output.mat', 'C' ); quit;

  31. Job submission example [smithic@ulgp5 multiple]$ cd /condor_data/smithic#change directory

  32. Job submission example [smithic@ulgp5 multiple]$ cd /condor_data/smithic#change directory [smithic@ulgp5 smithic]$ tar xf /opt1/condor/examples/handson.tar #get examples

  33. Job submission example [smithic@ulgp5 multiple]$ cd /condor_data/smithic#change directory [smithic@ulgp5 smithic]$ tar xf /opt1/condor/examples/handson.tar #get examples [smithic@ulgp5 smithic]$ cdmatlab#now in /condor_data/smithic/matlab

  34. Job submission example [smithic@ulgp5 multiple]$ cd /condor_data/smithic#change directory [smithic@ulgp5 smithic]$ tar xf /opt1/condor/examples/handson.tar #get examples [smithic@ulgp5 smithic]$ cdmatlab #now in /condor_data/smithic/matlab [smithic@ulgp5 matlab]$ ls#list files input0.mat input2.mat input4.mat product input1.mat input3.matproduct.m

  35. Job submission example [smithic@ulgp5 multiple]$ cd /condor_data/smithic#change directory [smithic@ulgp5 smithic]$ tar xf /opt1/condor/examples/handson.tar #get examples [smithic@ulgp5 smithic]$ cdmatlab #now in /condor_data/smithic/matlab [smithic@ulgp5 matlab]$ ls#list files input0.mat input2.mat input4.mat product input1.mat input3.matproduct.m [smithic@ulgp5 matlab]$ matlab_buildproduct.m#create standalone executable Submitting job(s). 1 job(s) submitted to cluster 503.

  36. Job submission example [smithic@ulgp5 multiple]$ cd /condor_data/smithic#change directory [smithic@ulgp5 smithic]$ tar xf /opt1/condor/examples/handson.tar #get examples [smithic@ulgp5 smithic]$ cdmatlab #now in /condor_data/smithic/matlab [smithic@ulgp5 matlab]$ ls#list files input0.mat input2.mat input4.mat product input1.mat input3.mat product.m product.exe [smithic@ulgp5 matlab]$ matlab_buildproduct.m#create standalone executable Submitting job(s). 1 job(s) submitted to cluster 503. [smithic@ulgp5 matlab]$ condor_q#get Condor queue status -- Schedd: Q6@ulgp5.liv.ac.uk : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 503.0 smithic 6/7 15:19 0+00:00:10 R 0 0.0 runscript.bat wrap

  37. Job submission example [smithic@ulgp5 multiple]$ cd /condor_data/smithic#change directory [smithic@ulgp5 smithic]$ tar xf /opt1/condor/examples/handson.tar #get examples [smithic@ulgp5 smithic]$ cdmatlab #now in /condor_data/smithic/matlab [smithic@ulgp5 matlab]$ ls#list files input0.mat input2.mat input4.mat product input1.mat input3.mat product.m product.exe [smithic@ulgp5 matlab]$ matlab_buildproduct.m#create standalone executable Submitting job(s). 1 job(s) submitted to cluster 503. [smithic@ulgp5 matlab]$ condor_q#get Condor queue status -- Schedd: Q6@ulgp5.liv.ac.uk : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 503.0 smithic 6/7 15:19 0+00:00:10 R 0 0.0 runscript.bat wrap 1 jobs; 0 idle, 1 running, 0 held [smithic@ulgp5 matlab]$ condor_q#job has finished when gone from queue -- Schedd: Q6@ulgp5.liv.ac.uk : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held

  38. Job submission example [smithic@ulgp5 matlab]$ ls input0.mat input2.mat input4.mat product.bat product.exe.manifest product.sub input1.mat input3.mat product product.exe product.m

  39. Job submission example [smithic@ulgp5 matlab]$ ls input0.mat input2.mat input4.mat product.bat product.exe.manifest product.sub input1.mat input3.mat product product.exe product.m [smithic@ulgp5 matlab]$ cat product #display file contents executable=product.exe indexed_input_files=input.mat indexed_output_files=output.mat total_jobs=5

  40. Job submission example [smithic@ulgp5 matlab]$ ls input0.mat input2.mat input4.mat product.bat product.exe.manifest product.sub input1.mat input3.mat product product.exe product.m [smithic@ulgp5 matlab]$ cat product #display file contents executable=product.exe indexed_input_files=input.mat indexed_output_files=output.mat total_jobs=5 [smithic@ulgp5 matlab]$ matlab_submit product #submit multiple Matlab jobs Submitting job(s)..... 5 job(s) submitted to cluster 511.

  41. Job submission example [smithic@ulgp5 matlab]$ ls input0.mat input2.mat input4.mat product.bat product.exe.manifest product.sub input1.mat input3.mat product product.exe product.m [smithic@ulgp5 matlab]$ cat product #display file contents executable=product.exe indexed_input_files=input.mat indexed_output_files=output.mat total_jobs=5 [smithic@ulgp5 matlab]$ matlab_submit product #submit multiple Matlab jobs Submitting job(s)..... 5 job(s) submitted to cluster 511. [smithic@ulgp5 matlab]$ condor_q#get status of jobs -- Schedd: Q6@ulgp5.liv.ac.uk : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 511.0 smithic 6/7 16:01 0+00:00:02 R 0 0.0 product.bat produc 511.1 smithic 6/7 16:01 0+00:00:02 R 0 0.0 product.bat produc 511.2 smithic 6/7 16:01 0+00:00:02 R 0 0.0 product.bat produc 511.3 smithic 6/7 16:01 0+00:00:02 R 0 0.0 product.bat produc 511.4 smithic 6/7 16:01 0+00:00:02 R 0 0.0 product.bat produc 5 jobs; 0 idle, 5 running, 0 held

  42. Job submission example [smithic@ulgp5 matlab]$ condor_q#some jobs completed, one still running -- Schedd: Q6@ulgp5.liv.ac.uk : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 511.0 smithic 6/7 16:01 0+00:00:25 R 0 0.0 product.bat produc 1 jobs; 0 idle, 1 running, 0 held

  43. Job submission example [smithic@ulgp5 matlab]$ condor_q#some jobs completed, one still running -- Schedd: Q6@ulgp5.liv.ac.uk : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 511.0 smithic 6/7 16:01 0+00:00:25 R 0 0.0 product.bat produc 1 jobs; 0 idle, 1 running, 0 held [smithic@ulgp5 matlab]$ condor_q#all jobs complete -- Schedd: Q6@ulgp5.liv.ac.uk : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held

  44. Job submission example [smithic@ulgp5 matlab]$ condor_q#some jobs completed, one still running -- Schedd: Q6@ulgp5.liv.ac.uk : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 511.0 smithic 6/7 16:01 0+00:00:25 R 0 0.0 product.bat produc 1 jobs; 0 idle, 1 running, 0 held [smithic@ulgp5 matlab]$ condor_q#all jobs complete -- Schedd: Q6@ulgp5.liv.ac.uk : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held [smithic@ulgp5 matlab]$ ls#check output files input0.mat input3.mat output1.mat output4.mat product.exe product.sub input1.mat input4.mat output2.mat product product.exe.manifest input2.mat output0.mat output3.mat product.bat product.m

  45. Job submission example [smithic@ulgp5 matlab]$ condor_q#some jobs completed, one still running -- Schedd: Q6@ulgp5.liv.ac.uk : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 511.0 smithic 6/7 16:01 0+00:00:25 R 0 0.0 product.bat produc 1 jobs; 0 idle, 1 running, 0 held [smithic@ulgp5 matlab]$ condor_q#all jobs complete -- Schedd: Q6@ulgp5.liv.ac.uk : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held [smithic@ulgp5 matlab]$ ls input0.mat input3.mat output1.mat output4.mat product.exe product.sub input1.mat input4.mat output2.mat product product.exe.manifest input2.mat output0.mat output3.mat product.bat product.m [smithic@ulgp5 matlab]$ zip output.zip output*.mat #bundle output files

  46. Summary • Condor can speed up processing by running large numbers of jobs in parallel • shorter jobs work best but can deal with jobs of arbitrary length • user-written codes easiest to run (MATLAB, R, C/C++, FORTRAN etc) • commercial 3rd party software may work • needs to run on standard MWS PC without user interaction • all Condor jobs submitted via central UNIX server

  47. Further Information Condor http://www.liv.ac.uk/e-science/condor i.c.smith@liverpool.ac.uk other research computing services http://www.liv.ac.uk/csd/research/ arc-support@liverpool.ac.uk

More Related