1 / 18

Ian C. Smith Computing Services Dept

The University of Liverpool Condor Pool. Ian C. Smith Computing Services Dept. University of Liverpool Condor Pool. contains around 300 machines running the University’s Managed Windows (XP) Service

strom
Download Presentation

Ian C. Smith Computing Services Dept

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The University of Liverpool Condor Pool Ian C. Smith Computing Services Dept

  2. University of Liverpool Condor Pool • contains around 300 machines running the University’s Managed Windows (XP) Service • most have 2.33 GHz Intel Core 2 processors with 2 GB RAM, 80 GB disk, configured with two job slots / machine • single submission point for Condor jobs provided by Sun Solaris V440 SMP server • restricted access to submit host for registered Condor users • policy is to run jobs only if a least 5 minutes of inactivity and low load average during office hours and at anytime outside of office hours • job will be killed off if running when a user logs in to a PC

  3. Condor service caveats • aimed at high throughput computing not high performance computing • only suitable for DOS-based applications running in batch mode • no communication between processes possible (“pleasantly parallel” applications only) • statically linked executables work best (although can cope with DLLs) • all files needed by application must be present on local disk (cannot access network drives) • no built-in check-pointing or standard output/error streaming • shorter jobs more likely to run to completion (10-20 min seems to work best) • very long running jobs can accommodated using Condor DAGMan or user level check-pointing (details available soon on the Condor website)

  4. Submitting jobs • login to Condor server (condor.liv.ac.uk) using secure shell (ssh) – available in PuTTy on MWS • upload any input files using secure FTP (sftp) – available in CoreFTP Lite on MWS • create job description file • submit job(s) using: • $ condor_submit <job_description_file> • Condor jobs should be submitted from the Condor data filesystem i.e. under /condor_data/<your_username> • sits on a large (1.2 TB) fast RAID system • filestore is not backed up !

  5. Typical job description file universe = vanilla transfer_files = always executable = example.exe output = stdout.out$(PROCESS) error = stderr.out$(PROCESS) log = mylog.log$(PROCESS) transfer_input_files = common.txt, myinput$(PROCESS).txt transfer_files = always requirements = ( Arch=="Intel") && ( OpSys=="WINNT51" ) notification = never queue 10

  6. Simplified job description file input_files = common.txt indexed_input_files = myinput.txt executable = example.exe indexed_stdout = stdout.out indexed_stderr = stderr.out indexed_log = mylog.log total_jobs = 10 • submit job(s) using: • $ mws_submit <job_description_file>

  7. Monitoring progress • to see the current state of the pool: • $ condor_status • to see what jobs are doing: • $ condor_q # all jobs • $ condor_q <your_username> # just your own • (may take several minutes before first job runs) • to find out why jobs aren’t running: • $ condor_q –analyze <jobID> • to cancel/remove jobs: • $ condor_rm <jobID> # single one • $ condor_rm –all # all of them ! • $ condor_rm –f <jobID> # forcibly remove jobs quickly

  8. Power-saving at Liverpool • have around 2 000 centrally managed PCs across campus which were powered up overnight, at weekends and during vacations. • original power-saving policy was to power-off machines after 30 minutes of inactivity, now hibernate them after 10 minutes of inactivity • policy has reduced wasteful inactivity time by ~ 200 000 – 250 000 hours per week (equivalent to 20-25 MWh) leading to an estimated saving of approx. £125 000 p.a. • makes extensive use of PowerMAN system from Data Synergy comprising: • service which forces machines into a low-power state and reports machine activity to Management Reporting Platform • Management Reporting Platform - central server from where usage stats can be retrieved and viewed via a web browser

  9. MATLAB advantages • originally developed for development of linear algebra algorithms but now contains many built-in functions geared to different disciplines divided into toolboxes • intuitive interactive environment allows rapid code development • simple but powerful file I/O: save <filename>, load <filename> (useful for check-pointing). • allows users to create their own functions stored as M-files • “standalone” applications can be built from M-files: • can run on platforms without MATLAB installed • do not need a licence to be able to run • can include all toolbox functions • APIs available for FORTRAN and C codes (“MEX files”)

  10. MATLAB disadvantages • even standalone applications can run slower than equivalent C or FORTRAN implementations. • standalone applications aren’t quite what they may seem: • more than just an .exe – “manifest” file needed to locate run-time libraries • need access to MATLAB run-time libraries usually via MATLAB Component Runtime (150 MB self-extracting .exe) • luckily we have MATLAB pre-installed on all PCs in Condor pool (originally used a network drive) • run-time errors can be difficult to trace when MATLAB jobs are run under Condor

  11. Local tools for MATLAB • to run MATLAB on Condor server without the GUI: • $ matlab_run <M-file> • should be used sparingly ! • to build standalone application using the pool: • $ matlab_build <M-file> • to run a simple M-file job on the pool: • $ m_file_submit <simplified_job_description_file> • run standalone applications on the pool • $ matlab_submit <simplified_job_description_file> • ideal for large clusters of jobs

  12. Power-saving at Liverpool • Have around 2 000 centrally managed PCs across campus which were powered up overnight, at weekends and during vacations. • Original power-saving policy was to power-off machines after 30 minutes of inactivity, now hibernate them after 15 minutes of inactivity • Policy has reduced wasteful inactivity time by ~ 200 000 – 250 000 hours per week (equivalent to 20-25 MWh) leading to an estimated saving of approx. £125 000 p.a. • Great news for sustainability, but not so good for Condor ...

  13. Adapting Condor for use with power-saving PCs • two main problems: • how to ensure Condor jobs are not evicted by hibernating PCs • how to wake up dormant PCs to run Condor jobs on-demand • power-saving software configured not to cause hibernation if a Condor job is running • PCs are woken up using new features of Condor • will only attempt to wake up PCs which match users’ requirements • number of machines woken up is ramped up to avoid server overload

  14. Automatic wakeup in action

  15. Pool status with power-saving smithic(ulgp4)smithic$ condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTim LSTC-01.livad.liv. WINNT51 INTEL Owner Idle 0.160 2045 0+00:21:1 LSTC-01.livad.liv. WINNT51 INTEL Unclaimed Idle 0.000 2045 [Unknown LSTC-02.livad.liv. WINNT51 INTEL Unclaimed Idle 0.000 2045 [Unknown LSTC-24.livad.liv. WINNT51 INTEL Unclaimed Idle 0.000 2045 [Unknown LSTC-25.livad.liv. WINNT51 INTEL Owner Idle 0.030 2045 0+00:15:1 LSTC-26.livad.liv. WINNT51 INTEL Owner Idle 0.040 2045 0+00:35:1 LSTC-26.livad.liv. WINNT51 INTEL Unclaimed Idle 0.000 2045 [Unknown LSTC-27.livad.liv. WINNT51 INTEL Unclaimed Idle 0.000 2045 [Unknown slot1@ARC2-12.liva WINNT51 INTEL Unclaimed Idle 0.000 1006 [Unknown slot1@ARC2-13.liva WINNT51 INTEL Owner Idle 0.100 1006 0+00:07:1 slot1@ARC2-15.liva WINNT51 INTEL Unclaimed Idle 0.010 1006 0+00:08:0 slot1@ARC2-16.liva WINNT51 INTEL Owner Idle 0.520 1006 0+00:05:1 slot1@ARC2-17.liva WINNT51 INTEL Unclaimed Idle 0.000 1006 [Unknown slot1@ARC2-17.liva WINNT51 INTEL Owner Idle 0.010 1006 0+00:03:1 slot2@MSC2-12.liva WINNT51 INTEL Unclaimed Idle 0.000 1006 0+00:00:0 ... Total Owner Claimed Unclaimed Matched Preempting Backfill INTEL/WINNT51 608 171 1 436 0 0 0 Total 608 171 1 436 0 0 0 } offline machines } offline machines

  16. Pool status with power-saving • to see all offline machines • $ condor_status –constraint Offline==True • to see all powered-up machines • $ condor_status –constraint Offline=!=True • offline machines also appear in condor_q –analyse: 1238.999: Run analysis summary. Of 613 machines, 0 are rejected by your job's requirements 246 reject your job because of their own requirements 21 match but are serving users with a better priority in the pool 16 match but reject the job for unknown reasons 0 match but will not currently preempt their existing job 330 match but are currently offline 0 are available to run your job

  17. Registering for the Condor Service • will need an account on the Sun UNIX service first (contact the CSD helpdesk: helpdesk@liverpool.ac.uk) • contact Ian Smith in CSD ( i.c.smith@liverpool.ac.uk ) with your UNIX username and brief details of your project • Condor account should go live within a few days

  18. More information • there is much more background information on the CSD Condor Pool website: http://www.liv.ac.uk/e-science/condor

More Related